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Self-Grading  and  Peer-Grading  for  Formative  and  Summative  Assessments 
in  3rd  Through  12th  Grade  Classrooms:  A  Meta- Analysis 

Carmen  E.  Sanchez,  Kayla  M.  Atkinson,  Alison  C.  Koenka,  Hannah  Moshontz,  and  Harris  Cooper 

Duke  University 


The  assessment  for  learning  movement  in  education  has  increased  attention  to  self-grading  and 
peer-grading  practices  in  primary  and  secondary  schools.  This  research  synthesis  examined  several 
questions  pertaining  to  the  use  of  self-grading  and  peer-grading  in  conjunction  with  criterion-referenced 
testing  in  3rd-  through  12th-grade-level  classrooms.  We  investigated  (a)  the  effects  of  students’  partic¬ 
ipation  in  grading  on  subsequent  test  performance,  (b)  the  difference  between  grades  when  assigned  by 
students  or  teachers,  and  (c)  the  correlation  between  grades  assigned  by  students  and  teachers.  Students 
who  engaged  in  self-grading  performed  better  ( g  =  .34)  on  subsequent  tests  than  did  students  who  did 
not.  Moderator  analyses  suggested  that  the  benefits  of  self-grading  were  estimated  to  be  greater  when  the 
study  controlled  for  group  differences  through  random  assignment.  Students  who  engaged  in  peer¬ 
grading  performed  better  on  subsequent  tests  than  did  students  who  did  not  (g  =  .29).  On  average, 
students  did  not  grade  themselves  or  peers  significantly  differently  than  teachers  (self-grades,  g  =  .04; 
peer-grades,  g  =  .04)  and  showed  moderate  correlation  (self-grading,  r  =  .67;  peer-grading,  r  —  .68)  with 
teacher  grades.  Further,  other  moderator  analyses  and  examination  of  studies  suggested  that  self-  and 
peer-grading  practices  can  be  implemented  to  positive  effect  in  primary  and  secondary  schools  with  the 
use  of  rubrics  and  training  for  students  in  a  formative  assessment  environment.  However,  because  of  a 
limited  number  of  studies,  these  mediating  variables  need  more  research  to  allow  more  conclusive 
findings. 

Keywords:  student  grading,  self-grading,  peer-grading,  meta-analysis,  primary  and  secondary  education 
Supplemental  materials:  http://dx.doi.org/10. 1037/edu0000190.supp 


Recent  educational  reform  has  emphasized  a  participatory  and 
collaborative  culture  of  learning  in  the  classroom.  Consequently, 
the  popularity  of  self-grading  and  peer-grading  (SPG)  in  primary 
through  12th  grade  classrooms  has  increased  (Hovardas,  Tsivita- 
nidou,  &  Zacharia,  2014)  and,  in  some  instances,  become  part  of 
school  culture  (Berger,  Rugen,  &  Woodfin,  2014).  As  receiving 
feedback  on  academic  work  is  an  established  mechanism  through 
which  students  learn  and  achieve  (Bangert-Drowns,  Kulik,  Kulik, 
&  Morgan,  1991;  Butler  &  Winne,  1995),  students  themselves  can 
serve  as  useful  sources  of  feedback  via  SPG. 
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SPG  involves  students  making  judgments  about  their  own 
and  others’  academic  performance.  They  evaluate  the  extent  to 
which  performance  criteria  and  standards  have  been  met  (Boud, 
1991)  and  provide  criterion-referenced  feedback,  that  is,  grad¬ 
ing,  to  themselves  or  others.  Although  SPG  in  college  class¬ 
rooms  has  received  much  attention  in  the  research  literature 
(Atkinson,  Sanchez,  Koenka,  Moshontz,  &  Cooper,  2016;  Fal- 
chikov  &  Boud,  1989;  Falchikov  &  Goldfinch,  2000),  less 
attention  has  been  paid  to  its  implementation  in  primary  and 
secondary  education.  The  current  research  synthesis  aims  to 
integrate  research  involving  (a)  the  effects  of  SPG  on  subse¬ 
quent  student  performance,  and  (b)  the  correspondence  between 
student  and  teacher  grades  in  primary  and  secondary  school 
classrooms,  in  particular,  differences  in  average  grades  given 
on  the  same  assessment  and  their  distributional  similarity  (cor¬ 
relation). 

Rather  than  simply  measuring  student  outcomes,  educational 
goals  now  target  evaluative  processes  that  also  can  improve 
student  performance  (Klenowski,  1995).  Thus,  the  “assessment 
of  learning”  paradigm  expanded  to  include  “assessment  for 
learning”  (Tillema,  Leenknecht,  &  Segers,  2011).  In  the  latter, 
students  are  active  participants  who  share  responsibility  and 
collaborate  with  the  teacher  in  the  assessment  process  (Dochy, 
Segers,  &  Sluijsmans,  1999).  The  increased  popularity  of  SPG 
accompanied  this  change  in  focus.  SPG  embodies  assessment 
for  learning  in  that  it  requires  students  to  engage  in  higher  level 
thinking  and  disciplined  inquiry  to  review,  clarify,  and  correct 
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one’s  own  or  others’  work.  Additionally,  as  students  apply 
assessment  criteria,  they  develop  a  clearer  conception  of  the 
assessed  material  because  of  increased  exposure  to  it  (Hovardas 
et  al.,  2014;  Ross,  2006). 

Theoretical  Justifications  for  SPG 

The  potential  advantages  of  SPG  to  students  can  be  gleaned 
from  numerous  theoretical  rationales.  These  can  be  grouped  ac¬ 
cording  to  increases  in  metacognition,  motivation,  and  transferra- 
ble  skills  (Sadler  &  Good,  2006;  Topping,  1998).  Metacognition, 
put  simply,  refers  to  “thinking  about  thinking,”  but  more  broadly 
refers  to  the  role  of  executive  processes  in  the  overseeing  and 
regulation  of  cognitive  processes  (Flavell,  1979).  Metacognition 
can  apply  to  a  person’s  declarative  knowledge  of  one’s  own 
learning  processes  (i.e.,  metacognitive  knowledge)  and/or  relates 
to  strategies  or  regulation  of  cognitive  activities  (i.e.,  metacogni¬ 
tive  skills).  Metacognition  has  been  suggested  to  be  the  most 
powerful  predictor  of  learning  and  implies  that  a  child  has  control 
and  knowledge  over  his  or  her  own  thinking  and  learning  activities 
(Wang,  Haertel,  &  Walberg,  1990).  In  terms  of  metacognitive 
benefits,  SPG  processes  require  students  to  make  judgments  about 
their  own  and  others’  work,  and,  as  a  result,  can  lead  to  increased 
awareness,  insight,  and  reasoning.  In  particular,  self-grading  en¬ 
courages  a  growth  mind-set  through  an  emphasis  on  revision  and 
progress  toward  a  higher  standard  of  achievement  (Andrade  & 
Valtcheva,  2009;  Dweck,  1986). 

Also,  SPG  provides  students  with  an  opportunity  to  become 
directly  involved  in  the  assessment  process,  resulting  in  a  greater 
sense  of  autonomy.  According  to  self-determination  theory,  en¬ 
hanced  autonomy  should,  in  turn,  predict  heightened  intrinsic 
motivation,  or  a  desire  to  learn  for  its  own  sake  (Ryan  &  Deci, 
2000).  Classroom  assessment  theory  also  predicts  that  shared 
ownership  of  the  assessment  process  will  increase  effort  and 
achievement  through  students’  increased  perceived  control  and 
responsibility  for  learning  (Brookhart,  1997). 

Additionally,  SPG  can  encourage  the  development  of  critical 
thinking  skills  about  students’  own  work.  Self-monitoring  may 
become  internalized  and  become  habit  for  students  (Andrade  & 
Valtcheva,  2009).  Finally,  involvement  in  the  assessment  process 
could  decrease  students’  cynicism  about  grading  (Evans  &  Engel- 
berg,  1988)  by  increasing  their  confidence  that  the  grade  “accu¬ 
rately”  reflects  learning.  When  explicit  grading  criteria  are  used, 
expectations  for  performance  become  more  transparent  and  stu¬ 
dents  can  more  clearly  understand  how  a  grade  was  earned, 
thereby  demystifying  the  grading  process  (Sadler  &  Good,  2006). 

In  regard  to  transferable  skills,  SPG  may  increase  communica¬ 
tion  and  collaboration  skills,  as  well  as  the  ability  to  evaluate 
future  work  in  professional  or  academic  contexts.  Furthermore,  the 
use  of  SPG  can  decrease  teachers’  time  spent  grading  (Sadler  & 
Good,  2006;  Topping,  1998). 

Self-Grading  and  Peer-Grading 

Self-grading  and  peer-grading  are  discussed  together,  given  that 
they  both  pertain  to  students’  involvement  in  classroom  assess¬ 
ment  practices.  However,  some  important  distinctions  exist  be¬ 
tween  the  two  grading  techniques  (van  Gennip,  Segers,  &  Tillema, 
2009).  In  the  continuum  of  learning  between  formative  assessment 


and  summative  assessment,  self-assessment  lies  closer  to  forma¬ 
tive  assessment  because  it  requires  the  active  participation  in  the 
judgment  of  students’  own  work  and  how  it  compares  with  the 
standard.  In  essence,  self-assessment  involves  self-regulation  and 
internalization  (Andrade  &  Du,  2007).  Furthermore,  self-grading 
assumes  a  growth  mind-set  by  empowering  the  students  to  make 
corrective  changes  on  their  work,  thereby  underscoring  that  learn¬ 
ing  is  incremental  as  opposed  to  just  getting  it  or  not  (Dweck, 
1986).  In  the  act  of  self-assessment,  students  are  invited  to  think 
about  the  quality  of  their  own  work  instead  of  havi  ng  someone  else 
evaluate  it  (Andrade  &  Valtcheva,  2009). 

In  comparison,  peer-grading  requires  that  a  student  actively 
participates  in  the  judgment  of  another  students’  work,  thereby 
making  peer-grading  a  fundamentally  interpersonal  process.  The 
interpersonal  nature  of  peer-grading  can  be  minimized  somewhat 
through  anonymous  and  masked  grading  (Topping,  2003).  Peer¬ 
grading  provides  an  opportunity  for  students  to  specify  the  quality 
of  a  product  of  other  equal-status  students  (Topping,  2009)  and 
provides  another  opportunity  to  apply  what  they  have  learned 
(Dunning,  Heath,  &  Suls,  2004).  Peer-assessment  activities  can 
vary  across  numerous  dimensions,  in  that  (a)  assessors  can  be 
individuals  or  even  pairs  or  teams  of  students  and  (b)  the  direction 
of  the  assessment  can  be  one-way  or  reciprocal.  Importantly,  self- 
and  peer-grading  are  not  mutually  exclusive  and  offer  potential  for 
triangulation;  one  can  lead  to  the  other  and,  in  turn,  inform  the 
other. 

Another  important  consideration  when  comparing  SPG  is  that 
self-grading  is  inherently  open  to  a  self-serving  bias  (Dunning  et 
al.,  2004).  The  “flawed”  nature  of  self-grading  relates  to  students’ 
tendency  to  overrate  themselves  because  of  overconfidence  in 
newly  learned  skills  and  students’  poor  assessment  of  their  own 
comprehension  skills.  Peer-grading  is  not  as  hindered  by  students’ 
tendency  to  be  overconfident  in  their  own  abilities  and  provides  an 
opportunity  to  inform  students  of  shortcomings  of  which  they 
might  have  been  previously  unaware  (Dunning  et  al.,  2004). 

In  addition  to  whether  students  are  asked  to  grade  their  own 
paper  or  that  of  a  peer,  SPG  will  be  influenced  by  the  student’s 
developmental  level  as  well  as  teacher  training  and  attitudes.  Brief 
descriptions  of  these  issues  and  related  research  can  be  found  in 
supplemental  file  A  of  the  online  supplemental  materials. 

Implementation  of  SPG 

The  success  of  SPG  in  the  classroom  may  depend  on  a  number 
of  implementation  factors.  Ross  (2006)  offered  guidelines  on  how 
to  successfully  implement  SPG  in  primary  and  secondary  class¬ 
rooms:  define  the  criteria  (or  rubric)  by  which  students  assess  their 
work;  teach  students  how  to  apply  the  criteria;  give  students 
feedback  on  their  grading;  and  allow  students  to  track  their  prog¬ 
ress  to  improve  performance.  Other  researchers  have  added  sug¬ 
gestions  to  Ross’s  guidelines,  such  as  prqviding  sufficient  time  for 
revision  after  student  assessment  and  using  SPG  as  assessment  for 
learning,  rather  than  assessment  of  learning  (Brown  &  Harris, 
2013).  Of  note,  and  perhaps  not  surprisingly,  some  have  suggested 
that  students  are  better  able  to  self-grade  and  peer-grade  fact-based 
tests  than  tests  that  require  more  interpretation  and  reasoning  skills 
(Bonniol,  1981;  Dunning  et  al.,  2004). 

To  reduce  bias  and  produce  clear  and  meaningful  assessment 
tasks,  some  researchers  have  suggested  additional  environmental 
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factors  to  consider  when  implementing  SPG  in  the  K-12  class¬ 
room.  These  include  students’  awareness  of  the  value  of  SPG 
(Goodrich,  1996);  a  supportive  classroom  environment  that  posi¬ 
tively  influences  a  child  s  likelihood  to  produce  and  report  grading 
results  (Kuncel,  Crede,  &  Thomas,  2005);  open  discussion  be¬ 
tween  students  and  teachers;  and  provision  of  qualitative  feedback 
(e.g.,  comments  that  inform  future  revisions;  Hodgson,  2010). 
Finally,  Tillema  et  al.  (2011)  suggested  that  fairness  and  transpar¬ 
ency  should  be  applied  to  all  steps  of  the  assessment  cycle. 

Two  important  components  of  SPG  in  the  classroom  are  the  use 
of  rubrics  and  student  training  to  provide  structure  and  guidance. 
Often,  the  student’s  SPG  process  is  marked  by  incremental  steps; 
first,  students  are  provided  a  clear  rubric  with  which  to  grade, 
followed  by  examples  of  how  to  grade,  and  then  students  practice 
grading.  An  ideal  rubric  provides  a  clear-  set  of  criteria  and  de¬ 
scribes  varying  levels  of  quality  for  a  specific  assignment  (“spe¬ 
cific”  rubric;  Andrade  &  Valtcheva,  2009).  Other  types  of  rubrics 
may  provide  some  criteria  reflecting  the  underlying  skills  and 
knowledge  within  the  defined  domain,  but  ultimately  leaves  the 
grader  to  make  an  overall  judgment  on  the  quality  of  the  work 
(“general"  rubric;  Lane,  2012).  Student  training  can  include  mod¬ 
eling,  direct  instruction,  and  practice,  which  are  commonly  em¬ 
ployed  classroom  practices.  For  example,  teachers  may  demon¬ 
strate  rubric  use  to  the  class,  provide  guidance  while  students 
engage  in  the  SPG  process,  and  then  give  students  an  opportunity 
to  practice  SPG  independently. 

Previous  Reviews  of  SPG  Research 

Although  several  reviews  on  SPG  exist,  most  focus  solely  on 
college  or  professional  school  samples  (e.g..  Topping,  1998)  or 
combine  K-12  with  college  studies  in  the  synthesis  (e.g.,  van 
Zundert,  Sluijsmans,  &  van  Merrienboer,  2010).  However,  as 
discussed  previously,  K-12  and  college  classrooms  should  be 
considered  separately  because  of  differences  in  students’  cognitive 
development  and  learning  environments. 

Meta-Analyses  on  SPG  in  College 

Despite  the  importance  of  focusing  separately  on  K-12  studies, 
two  very  similar  meta-analyses  on  SPG  in  the  college  setting 
warrant  mention.  Falchikov  and  Boud  (1989)  aggregated  57  stud¬ 
ies  that  investigated  self-assessment  compared  with  instructor  as¬ 
sessment.  Their  results  suggested  that  college  students  give  them¬ 
selves  higher  grades  than  their  instructors  ( d  =  Al)  and  that 
student  grades  demonstrated  a  moderate  relationship  with  teacher 
grades  (r  =  .39).  The  quality  of  the  study,  course  level  (i.e., 
introductory  vs.  advanced),  and  subject  matter  influenced  the 
correspondence  between  self-grading  and  teacher-grading.  Fal¬ 
chikov  and  Goldfinch  (2000)  conducted  another  meta-analysis 
with  48  studies  on  the  effects  of  peer-grading  and  concluded  that 
students  give  their  peers  higher  grades  than  instructors  ( d  =  .24) 
and  showed  a  strong  intercorrelation  among  peer-graders  (r  = 
.69).  Design  quality,  use  of  rubrics,  and  the  nature  of  the  assess¬ 
ment  task  (i.e.,  academic  or  professional  task)  moderated  the 
effects  of  SPG. 


Narrative  Reviews  on  Self-Grading  in 
K-12  Classrooms 

Four  reviews  have  narratively  examined  SPG  in  the  K-12  class¬ 
room.  Ross  (2006)  reviewed  research  evidence  on  student  self¬ 
grading,  focused  largely  on  the  K-12  setting.  Evidence  on  the 
alignment  of  self-grading  with  teacher-grading  was  weak,  with  few 
studies  ( k  =  2)  directly  examining  the  relationship  between  self, 
peer,  and  teacher  grades  for  a  specific  outcome  measure  (that  is, 
mean  grades,  degree  of  variation  in  grades,  and/or  correlations 
between  student  and  teacher  grades  related  to  the  same  test). 
Self-grading  was  found  to  generally  improve  student  performance 
on  subsequent  assessments  (14  of  16  studies  reported  positive 
effects). 

Brown  and  Harris  (2013)  synthesized  studies  examining  the 
effects  of  self-grading  practices  in  kindergarten  through  12th 
grade.  However,  their  review  is  notable  in  that  they  broadly 
defined  self-grading  by  including  both  studies  on  more  general 
self-ratings  (e.g.,  van  Kraayenoord  &  Paris,  1997)  and  “self-rated 
confidence  in  accuracy  of  work”  (e.g.,  Koivula,  Hassmen,  &  Hunt, 
2001).  The  median  effect  size  (ES)  fell  between  .40  and  .45 
(range  =  —.04  to  1.62;  no  overall  ES  was  reported),  with  weak  to 
strong  correlations  between  student  and  teacher  distributions  of 
grades  ( r  range  =  .2  to  .8).  However,  the  summary  statistics  did 
not  distinguish  between  studies  that  reported  self-grading,  self¬ 
rating,  and  self-estimates  of  performance  from  those  that  reported 
the  effect  of  self-grading  on  subsequent  test  scores.  Brown  and 
Harris  (2013)  concluded  that  self-grading  can  improve  learning 
outcomes  when  students  are  engaged  in  self-regulation  processes 
(e.g.,  self-monitoring  against  objective  standards)  and  when  teach¬ 
ers  are  actively  engaged  in  the  development  and  monitoring  of 
self-grading.  Increasing  age  (and  related  school  experience)  ap¬ 
peared  to  improve  the  correspondence  between  students  and 
teachers. 

Two  meta-analyses  have  investigated  the  impact  of  self- 
assessment  on  subsequent  writing  performance  in  primary  schools. 
In  2011,  the  Carnegie  Corporation  examined  the  influence  of 
formative  writing  assessment  to  improve  writing  achievement 
(Graham,  Harris,  &  Hebert,  2011).  Based  on  seven  reports,  the 
authors  found  that  when  students  are  taught  how  to  self-grade  their 
own  work,  scores  improved  by  .46  standard  deviations.  No  mod¬ 
erator  analyses  were  conducted  for  this  study.  The  authors  con¬ 
cluded  that  self-assessment  is  an  evidence-based  practice  for  im¬ 
proving  the  writing  of  American  students  (Graham  et  al.,  2011). 

Another  meta-analysis  investigated  the  effects  of  self-grading  of 
writing  assignments  on  subsequent  writing  performance  (Graham, 
Hebert,  &  Hams,  2015).  The  average  weighted  ES  of  self-grading 
from  a  total  of  11  reports  was  .62  standard  deviations,  which 
indicated  a  significant  impact  of  self-grading  on  subsequent  per¬ 
formance.  A  metaregression  showed  that  the  quality  of  the  study, 
feedback  structure,  or  grade  level  moderated  the  effect.  Both 
meta-analyses  also  reported  on  the  effect  of  peer  feedback,  but 
their  results  were  not  specific  to  peer-grading,  so  studies  involving 
students  providing  feedback  without  a  grade  were  also  included. 
Taken  together,  these  reports  indicate  that  the  provision  of  self¬ 
feedback,  including  self-grading,  has  a  positive  influence  on  stu¬ 
dents’  writing  achievement. 
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Peer-Grading  in  K-12  Classrooms 

Topping  (2013)  summarized  the  research  on  peer-assessment  in 
elementary  and  secondary  schools.  He  noted  that  previous  litera¬ 
ture  reviews  failed  to  operationalize  peer-grading.  A  large  propor¬ 
tion  of  studies  used  survey  methodologies,  and  very  few  used 
quasi-experimental  designs  (two  of  16).  Notably,  the  review  also 
pointed  to  the  need  for  more  rigorous  methodology  in  primary 
studies;  no  study  conducted  in  elementary  school  implemented  an 
experimental  design,  and  only  one  study  in  secondary  schools  used 
a  comparison  group  with  a  posttest-only  design. 

Finally,  Sebba  et  al.  (2008)  conducted  a  synthesis  of  research 
evidence  on  the  impact  of  students’  SPG  in  secondary  schools. 
Most  studies  occurred  in  the  United  States  (62%),  and  only  1 1  of 
26  (42%)  involved  comparison  groups.  Sixty  percent  of  studies 
(nine  of  the  15)  measuring  achievement  reported  higher  scores 
after  use  of  SPG.  Among  other  suggestions  that  were  consistent 
with  Ross’s  (2006)  recommendations,  Sebba  et  al.  (2008)  sug¬ 
gested  that  teachers  need  instruction  in  SPG  in  both  initial  training 
and  continuing  professional  development. 

Taken  together,  narrative  research  syntheses  and  meta-analyses 
on  both  SPG  point  to  the  paucity  of  high-quality  empirical  studies 
in  K-12  classrooms.  Furthermore,  most  studies  focused  on  the 
long-term  effects  of  SPG,  thereby  adhering  to  the  formative  as¬ 
sessment  view  of  SPG.  Few  reviews  investigated  the  degree  of 
grade  similarity  and/or  correspondence  between  student  and 
teacher  grades,  suggesting  that  less  attention  is  paid  to  SPG  as 
summative  assessment  in  primary  and  secondary  classrooms. 

These  reviews  pointed  to  some  important  moderators  in  SPG. 
Although  self-grading  and  peer-grading  are  discussed  collectively 
in  this  manuscript  as  “student  grading,”  they  are  distinct  entities 
that  warrant  separate  examination.  Past  reviews  have  pointed  to 
experimental  design,  classroom  characteristics  (i.e.,  students’  year 
in  school,  course  subject),  student  training,  and  rubric  use  to  be 
potentially  relevant  moderators  that  might  improve  SPG  outcomes 

The  Present  Research  Synthesis 

The  present  research  synthesis  fills  a  void  in  the  SPG  literature 
in  relation  to  the  effects  of  SPG  implementation  and  the  corre¬ 
spondence  between  SPG  and  teacher  grades  in  primary  and  sec¬ 
ondary  classrooms.  Our  meta-analysis  extends  previous  reviews  by 
updating  the  evidence  base,  providing  cumulative  statistics,  and 
more  formally  testing  for  moderators  of  SPG  effects.  The  primary 
purpose  of  this  article  is  to  examine  whether  and  when  SPG  are 
effective  techniques  in  primary  and  secondary  school  classrooms. 
This  includes  examining  the  long-term  achievement-related  con¬ 
sequences  of  SPG.  It  also  includes  an  examination  of  how  corre¬ 
spondent  students’  grades  are  to  teacher  grades  with  regard  to  both 
the  mean  grades  given  and  their  correlation  (the  placement  of  a 
particular  student’s  grade  in  the  grade  distribution). 

In  sum,  in  order  to  determine  the  effects  of  self-  and  peer¬ 
grading  in  the  classroom,  we  conducted  a  series  of  meta-analyses 
to  examine  the  following  research  questions:  (a)  What  are  the 
effects  of  self-grading  and  peer-grading  practices  on  subsequent 
test  performance?;  (b)  What  is  the  mean  difference  between  SPG 
and  teacher  grades  when  grading  the  same  test?;  and  (c)  What  is 
the  degree  of  distributional  correspondence  between  student  and 
teacher  graders?1 


In  addition  to  our  main  research  questions,  we  also  investigated 
variables  that  might  moderate  these  relationships.  We  based  our 
choice  of  moderators  on  the  claims  of  scholars  who  have  previ¬ 
ously  written  on  SPG  and  have  summarized  the  research  literature. 
Specifically,  we  hypothesized  that  the  effect  of  SPG  would  be 
greater 

•  in  secondary  than  in  primary  school  grades; 

•  in  STEM  classes  rather  than  non-STEM  classes; 

•  with  students’  use  of  rubrics  rather  than  no  definition  of 
criteria  by  which  to  score  students’  work; 

•  with  students’  training  on  how  to  self-  or  peer-grade 
compared  with  no  training; 

•  with  longer  training  exposures  compared  with  short  train¬ 
ing  exposure;  and 

•  with  multiple  modes  of  training  rather  than  either  exam¬ 
ples  or  training. 

For  the  first  research  question  examining  the  long-term  impli¬ 
cation  of  SPG,  a  typical  study  would  have  the  experimental  group 
complete  an  assignment,  then  perform  self-  or  peer-grading  to 
determine  a  score  for  the  assignment.  The  control  group  would  not 
be  given  any  assignment  at  all  or  their  assignments  would  be 
graded  by  the  teacher.  After  the  SPG  had  occurred  in  the  experi¬ 
mental  group  (on  one  or  multiple  occasions),  all  students  would  be 
given  a  different  posttest  that  was  scored  by  the  teacher  or  exper¬ 
imenter.  The  scores  from  the  posttest  served  as  the  outcome 
measure  and  would  be  compared  between  the  experimental  and 
control  groups  to  determine  the  effects  of  SPG  on  subsequent  tests. 

For  the  second  and  third  research  questions  examining  the 
relationship  between  teacher  and  student  grading,  the  same  test 
would  be  scored  by  the  teacher  and  by  either  the  student  them¬ 
selves  or  a  peer.  The  scores  given  by  the  teacher  or  self/peer  would 
then  be  compared  with  each  other  to  determine  degree  of  average 
grade  similarity  and  the  correlation  of  grades  with  one  another.  In 
these  studies,  the  teacher-graded  test  and  student-graded  test  would 
be  considered  to  be  yoked  outcomes  of  the  control  condition  and 
experimental  condition,  respectively. 

With  increased  emphasis  on  students’  active  involvement  in  the 
learning  process,  a  clear  understanding  of  SPG  in  the  elementary 
and  secondary  schools  is  needed.  We  hope  to  contribute  to  this 
understanding. 

Method 

Criteria  for  Including  Studies 

A  study  had  to  meet  several  criteria  to  be  included  in  the 
research  synthesis.  First,  the  study  had  to  focus  on  one  or  more  of 
the  following  research  themes:  (a)  the  influence  of  prior  use  of 
students  as  graders  on  these  students’  subsequent  test  performance, 
(b)  mean  score  differences  when  assigned  by  different  graders  (i.e., 
teacher,  peer,  or  self),  and/or  (c)  the  correlation  between  scores 
assigned  by  teachers,  peers,  or  self.  For  the  first  research  theme, 
reports  had  to  assign  some  participants  to  take  part  in  student 


1  Of  note,  our  discussion  of  student  grading  purposefully  avoided  the 
term  “accuracy,  as  it  implies  that  teacher  grades  are  a  perfect  reflection  of 
student  performance.  Instead,  we  refer  to  “mean  difference,”  “correspon¬ 
dence,  and  “correlation”  when  we  make  our  comparisons. 
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grading  (treatment),  whereas  others  did  not  participate  in  student 
grading  (comparison).  All  students  were  later  tested  and  graded  by 
a  teacher  or  experimenter.  For  the  second  and  third  research 
themes,  reports  had  to  have  both  teachers  and  students  grade  the 
same  student  test  and  compare  either  teacher-  versus  peer-graders 
or  teacher-  versus  self-graders.  All  included  studies  had  to  either 
use  random  assignment  of  students  to  conditions  or  some  form  of 
quasi-experimental  design.  Studies  in  which  students  served  as 
their  own  controls  (i.e.,  pretest-posttest)  were  not  included. 

For  the  purposes  of  these  meta-analyses,  we  defined  the  con¬ 
struct  of  SPG  as  using  criterion-referenced  feedback,  that  is,  when 
students  assigned  numerical  values  (or  letter  grades  subsequently 
converted  to  numerical  values)  to  either  their  own  or  a  peer’s  work 
in  an  attempt  to  make  an  objective  judgment  about  the  quality  of 
the  work.  In  this  sense,  the  grading  criterion  had  to  include  more 
than  written  comment  or  feedback  or  general  reflection.  It  had  to 
include  a  numerical  value  or  grade.  Accordingly,  SPG  requires  the 
students  to  assess  the  task  directly  and  systematically  use  criterion 
or  standards  that  are  task-specific. 

We  employed  three  additional  exclusion  criteria  before  coding 
began.  First,  we  excluded  studies  that  did  not  provide  quantitative 
marking  on  a  specific  learning  outcome  variable,  for  example,  in 
some  excluded  studies,  students  graded  their  own  general  compe¬ 
tence  in  a  subject  area  (e.g.,  Ikeguchi,  1996),  students  provided 
perceptions  on  how  well  they  or  their  peers  performed  (e.g., 
Wright  &  Houck  1995),  students  edited  the  writing  process  (and 
these  were  tallied)  as  opposed  to  providing  a  measure  of  a  specific 
learning  outcome  variable  (e.g.,  Fitzgerald  &  Markham,  1987; 
Paquette,  2008),  or  students  ranked  themselves  compared  with 
peers  (e.g.,  Crocker  &  Cheeseman  1988).  Second,  we  excluded 
reports  that  did  not  have  a  comparison  group  (e.g.,  Andrade,  Du, 
&  Wang,  2008).  For  the  first  hypothesis,  this  meant  that  the  study 
had  to  have  a  comparison  group  that  did  not  partake  in  self-  or 
peer-grading.  For  our  second  and  third  hypotheses,  the  same  test 
had  to  be  graded  twice,  once  by  the  teacher  and  once  by  the 
student — the  former  served  as  the  comparison.  Lastly,  we  ex¬ 
cluded  reports  that  did  not  provide  enough  information  to  calculate 
an  ES  (e.g.,  Beach,  1979;  Bickmore,  1981).  For  reports  with 
unclear  methods  or  missing  information  to  calculate  an  ES,  the 
study’s  authors  were  contacted  for  additional  information.  We 
limited  the  contact  to  authors  who  had  recently  published  (i.e., 
since  2005;  n  =  3),  and  two  authors  responded,  which  allowed  us 
to  include  their  studies  in  the  synthesis. 

Literature  Search  Procedures 

Our  initial  database  searches  sought  to  identify  any  studies 
related  to  effective  grading  strategies  in  general.  To  do  so,  we  first 
searched  the  ERIC  and  PsycINFO  electronic  reference  databases 
for  published  and  unpublished  documents  related  to  grading  strat¬ 
egies.  The  two  databases  were  chosen  because  they  were  most 
likely  to  contain  reports  related  to  education  and  developmental 
differences  that  might  affect  instructional  practices.  The  searches 
were  conducted  during  February  2016  and  were  not  restricted  by 
date  of  report  dissemination.  The  subject  (SU)  term  “grades  (scho¬ 
lastic)”  was  paired  separately  in  intersection  with  the  following  SU 
terms:  “evaluation  methods,”  “evaluation  criteria,”  “test  methods,” 
“measurement  technique,”  “peer  evaluation,”  “self  evaluation,” 
“multiple  choice  tests,”  “peer  grading,”  “self  grading,”  “peer  as¬ 


sessment,”  and  “self  assessment.”  After  the  initial  search,  a  second 
search  was  performed  using  the  same  databases  and  subject  terms 
with  the  key  term  “grading  (educational)”  in  intersection  with  the 
terms  above.  Searches  were  conducted  sequentially,  with  overlap¬ 
ping  documents  excluded  from  each  subsequent  search. 

Three  coders  (two  research  assistants  and  a  postdoctoral  fellow) 
were  trained  to  examine  each  report’s  title,  abstract,  and  keywords 
from  the  search  results.  Each  researcher  worked  independently  to 
categorize  each  document  as  to  whether  (a)  it  was  irrelevant,  that 
is,  mentioned  grading  of  tests  not  at  all  or  only  in  passing;  (b) 
contained  background  information  on  grading  strategies  but  was 
not  an  empirical  study;  or  (c)  included  empirical  data  on  the 
research  question  of  interest. 

Within  the  final  grouping  of  studies,  studies  conducted  using 
students  in  Grades  K-12  were  separated  from  studies  using  college 
students.  Studies  of  grading  strategies  compared  different  tech¬ 
niques  for  determining  or  assigning  grades.  Typically,  they  in¬ 
volved  the  experimental  manipulation  and  comparison  of  more 
than  one  grading  strategy.  Pretest-posttest  designs  and  case  studies 
of  a  particular  strategy  were  also  included  in  this  category.  Ex¬ 
cluded  documents  containing  empirical  data  about  grading  were 
those  that  reported  on  the  grading  practices  of  teachers  (without 
any  comparison  with  student  graders),  grading  systems  for  pro¬ 
gram  evaluation,  or  a  comparison  between  online  and  on-site 
classes.  If  at  least  two  coders  agreed  on  a  document’s  placement, 
it  was  placed  into  the  agreed-upon  category.  If  the  disagreement 
could  not  be  resolved,  the  principle  investigator  was  consulted.  If 
relevant  reports  were  misclassified  during  the  initial  coding  pro¬ 
cess,  two  other  techniques  (e.g.,  the  examination  of  reference  and 
citation  lists  in  relevant  reports,  contact  with  active  researchers; 
see  below  for  more  detail)  would  help  to  uncover  the  reports  again. 
In  total,  1,459  abstracts  were  examined  and,  of  these,  323  (22%) 
were  deemed  to  fit  Criteria  (b)  or  (c)  within  the  K-12  domain  by 
at  least  two  researchers. 

We  then  obtained  the  323  potentially  relevant  documents,  94  of 
which  contained  empirical  data.  These  reports  were  examined  in 
their  entirety.  The  relevant  reports  were  further  grouped  as  belong¬ 
ing  to  specific  grading  strategies.  This  categorization  suggested 
that  the  questions  “Does  student  grading  affect  later  achieve¬ 
ment?”  and  “Do  self-assigned  and  peer-assigned  mean  grades 
differ  from  teacher  grading?”  had  received  most  of  the  research 
attention  in  the  K-12  grading  literature.  Specifically,  we  identified 
38  empirical  studies  investigating  SPG  in  the  K-12  literature  (other 
studies  addressed  education-related  themes,  such  as  effect  of  type 
of  test  response,  effect  of  grade  cutoffs).  We  subsequently  used 
these  38  studies  as  the  basis  to  find  other  relevant  studies.  How¬ 
ever,  some  of  these  initial  studies  were  later  excluded  (see  exclu¬ 
sion  criteria  below). 

Additional  Search  Strategies 

Three  additional  strategies  were  employed  to  ensure  that  we 
identified  potentially  relevant  reports  that  may  not  have  been 
identified  with  prior  searches  in  the  reference  databases:  direct 
contact  with  researchers,  backward  searches,  and  forward 
searches.  We  contacted  educational  researchers  to  learn  about 
undiscovered  projects  that  were  relevant  but  difficult  to  find,  such 
as  very  recent  research  and  unpublished  reports.  Specifically,  we 
contacted  researchers  who  had  written  relevant  articles  in  the  past 
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10  years  ( n  =  4).  Three  of  the  four  researchers  responded  but  did 
not  provide  additional  reports.  We  also  sent  an  e-mail  to  the 
American  Educational  Research  Association  Classroom  Assess¬ 
ment  Special  Interest  Group  listserv  to  request  that  members  share 
any  research  that  related  to  SPG  in  the  K- 12  classrooms.  No  replies 
were  received. 

Next,  we  examined  the  reference  lists  of  all  reports  that  met 
inclusion  criteria  to  determine  whether  they  cited  any  potentially 
relevant  reports  (backward  search).  We  then  conducted  a  cited 
reference  search  to  determine  whether  the  reports  that  met  inclu¬ 
sion  criteria  had  been  later  cited  by  any  potentially  relevant  reports 
(forward  search).  We  reviewed  these  articles  for  relevance.  For 
backward  and  forward  searches,  titles  and  abstracts  were  initially 
reviewed  by  the  first  author,  and  if  deemed  potentially  relevant,  the 
full  text  was  obtained.  Forward  and  backward  searches  were 
conducted  on  any  empirical  study  report  and/or  literature  review 
that  was  determined  to  be  relevant.  These  two  search  strategies 
yielded  29  potential  articles.  Finally,  we  also  received  documents 
(n  —  3)  from  colleagues  in  our  laboratory  who  were  conducting 
searches  on  different  but  related  educational  research  topics.  The 
first  author  examined  the  full  text  of  1 13  studies;  53  did  not  pass 
initial  screening  for  possible  relevance  according  to  the  exclusion 
criteria  listed  above,  and  27  studies  required  a  more  thorough 
inspection,  but  were  ultimately  excluded  for  the  reasons  listed  in 
supplemental  file  B  of  the  online  supplemental  materials.  This 
resulted  in  a  total  of  33  articles  eligible  for  the  meta-analyses. 

Procedure  for  Synthesis  of  Studies 

Information  retrieved  from  studies.  Numerous  characteris¬ 
tics  of  each  study  were  retrieved  from  reports  and  entered  into  a 
database.  These  characteristics  encompassed  six  broad  distinctions 
among  studies:  (a)  the  research  report  included  basic  information 
about  the  authorship  and  date  of  report  appearance;  (b)  study 
characteristics  included  information  about  the  setting,  cultural 
context,  and  design  features  of  the  study;  (c)  sample  information 
detailed  the  demographic  characteristics  of  the  different  samples  of 
students;  (d)  grader  comparison  information  included  general 
grading  instructions,  rubric  use,  and  grader  training;  (e)  outcome 
measures  included  details  pertaining  to  the  test  format  and  subject; 
and  (f)  estimate  ofES  detailed  the  information  needed  to  calculate 
an  ES  for  the  relationship  between  grading  outcomes  (e.g.,  n ,  M, 
SD,  and/or  correlations).  As  is  true  in  all  meta-analyses,  many  of 
the  study  characteristics  we  coded  either  were  not  reported  often 
enough  or  exhibited  too  little  variability  across  studies  to  be 
examined  through  moderator  analyses.  Variables  that  were  exam¬ 
ined  in  the  meta-analysis  will  be  presented  along  with  the  overall 
results. 

Effect  size  estimation.  To  answer  the  first  and  second  re¬ 
search  questions,  we  used  the  standardized  mean  difference,  or 
Hedges’  g  (Hedges  &  Olkin,  1985),  to  estimate  the  effect  of  the 
student  grader  on  student  outcomes  (i.e.,  subsequent  learning  or 
mean  grade  given)  reported  by  each  study.  In  most  cases,  we  first 
calculated  ESs  (Cohen’s  d;  Cohen,  1988)  from  means  and  standard 
deviations  using  the  ES  calculator  provided  by  Wilson  (2001).  We 
then  input  the  ES  and  group  ns  into  a  statistical  program  (Com¬ 
prehensive  Meta-Analysis;  Borenstein,  Hedges,  Higgins,  &  Roth- 
stein,  2014)  to  calculate  Hedges’  g.  If  means  and  standard  devia¬ 
tions  were  not  available,  we  indirectly  retrieved  the  information 


needed  to  calculate  d-indexes  using  inferential  statistics  (Boren¬ 
stein  et  al.,  2014;  Wilson,  2001).  Several  reports  presented  sepa¬ 
rate  means  and  standard  deviations  for  multiple  subsections  of  one 
test.  In  these  cases,  ESs  were  calculated  for  each  domain. 

To  compare  the  effects  of  student  grading  on  subsequent  test 
performance,  we  subtracted  the  mean  grade  given  by  the  teacher- 
grader  group  (comparison)  from  the  mean  grade  given  by  the 
student-grader  group  (treatment)  and  then  divided  by  the  differ¬ 
ence  of  their  weighted  average  standard  deviation.  In  this  case, 
positive  g-indexes  indicated  that  the  experience  of  student  grading 
increased  students’  performance  on  later  tests.  To  compare  student 
versus  teacher  on  grades,  we  subtracted  the  mean  grade  given  by 
instructors  from  the  mean  grade  given  by  self-  or  peer-graders  and 
divided  their  weighted  average  standard  deviation.  Thus,  positive 
g-indexes  indicated  that  grades  given  by  peers  or  the  self  were 
higher  than  grades  given  by  instructors. 

To  address  the  third  research  question,  for  those  reports  con¬ 
tributing  correlations  between  students  and  teachers,  correlation 
coefficients  were  coded  exactly  as  reported.  Thus,  positive  corre¬ 
lations  indicated  agreement  between  student  and  expert  graders 
and  larger  positive  correlations  indicated  stronger  agreement. 

Coder  reliability.  Each  research  report  was  coded  indepen¬ 
dently  by  two  coders  (a  research  assistant  and  a  postdoctoral 
fellow).  If  there  was  a  discrepancy  in  coding,  the  two  coders 
discussed  each  disagreement  until  agreement  was  reached.  If  the 
disagreement  could  not  be  resolved,  the  principle  investigator  was 
consulted.  Because  all  studies  were  independently  coded  twice  and 
disagreements  were  resolved  by  a  third  independent  coder,  the 
effective  reliability  of  codes  is  very  high  (Rosenthal,  1987)  and  an 
estimate  of  reliability  (which  would  involve  two  new  coders  and 
an  independent  disagreement  resolver)  is  not  called  for  (APA 
Publications  and  Communications  Working  Group  on  Quantitative 
Research  Reporting  Standards,  2016). 

Identification  of  statistical  outliers.  First,  we  examined  the 
distribution  of  ESs,  for  both  g-indexes  and  r  values,  to  determine 
whether  any  were  statistical  outliers.  The  Grubbs  (1950)  test,  also 
called  “the  maximum  normed  residual  test”  (also  see  Barnett  & 
Lewis,  1994),  identifies  outliers  in  univariate  distributions  one 
observation  at  a  time.  If  outliers  were  identified  (using  p  <  .05, 
two-tailed,  as  the  significance  level),  these  values  were  set  at  the 
value  of  their  next  nearest  neighbor.  Separate  tests  were  conducted 
for  those  reports  contributing  g-indexes  and  correlations  for  the 
separate  research  questions.  This  same  procedure  was  also  applied 
to  the  distribution  of  samples  sizes. 

Publication  bias.  Despite  the  use  of  several  complementary 
search  techniques,  the  possibility  always  remains  that  we  were 
unable  to  obtain  all  studies  that  have  investigated  our  research 
questions.  Therefore,  we  used  the  Duval  and  Tweedie  (2000a, 
2000b)  trim-and-fill  procedure  to  test  whether  the  distribution  of 
ESs  used  in  the  analyses  was  consistent  with  variation  in  ESs  that 
would  be  predicted  if  the  estimates  were  normally  distributed.  For 
example,  a  skewed  distribution  might  indicate  a  possible  publica¬ 
tion  bias  created  either  by  the  study  retrieval  procedures  or  by  data 
censoring  on  the  part  of  authors.  The  trim-and-fill  procedure 
provides  a  way  to  estimate  the  values  from  missing  studies  that 
need  to  be  present  to  approximate  a  normal  distribution.  Often, 
these  missing  values  indicate  nonsignificant  results  that  are  less 
likely  to  make  their  way  into  obtainable  reports. 


SELF-  &  PEER-GRADING  IN  3RD-12TH  GRADE  CLASSROOMS 


1055 


Independent  hypothesis  tests.  To  avoid  a  potential  biasing 
effect  of  multiple  ESs  per  study,  we  conducted  random  effects 
meta-analyses,  using  robust  variance  estimation  with  small  sample 
correction  (Hedges,  Tipton,  &  Johnson,  2010;  Tanner-Smith  & 
Tipton,  2014;  Tipton,  2015).  The  robust  variance  estimator  (RVE) 
addresses  the  problem  of  correlated  group  pairs  by  mathematically 
adjusting  the  standard  errors  of  the  ESs  to  account  for  the  depen¬ 
dence  and  the  small  sample  correction  maintains  appropriate  Type 
I  error  rates.  An  intraclass  correlation  was  specified  (p  =  .8)  to 
estimate  the  ES  weights.  The  RVE  method  has  one  important 
limitation:  Tanner-Smith,  Tipton,  and  Polanin  (2016)  assert  that 
when  is  df  <  4  (or  the  number  of  studies  <5  for  a  single  predictor 
analysis),  use  of  the  RVE  method  is  not  suggested  because  of  the 
unreliability  of  the  t-distribution  and  underestimation  of  the  true 
Type  I  error.  In  these  cases,  summary  statistics  were  computed 
with  Comprehensive  Meta-Analysis  software  without  controlling 
for  study  clustering,  which  is  less  likely  to  be  influential  with  a 
small  number  of  studies.  Heterogeneity  was  assessed  using  t2,  the 
between-studies  variance  component,  and  the  7 2  statistic,  which  is 
the  percentage  of  the  total  variability  attributable  to  variation  in 
ESs  (and  not  sampling  of  participants  into  studies).  Higher  t2 
values  denote  higher  proportions  of  the  observed  variation  to  be 
real  (Borenstein  et  al.,  2014).  Approximate  guidelines  for  inter¬ 
preting  I2  values  have  been  established  at  25%,  50%,  and  75%  for 
low,  medium,  and  large  heterogeneity,  respectively  (Higgins  & 
Thompson,  2002;  Higgins,  Thompson,  Deeks,  &  Altman,  2003). 

We  first  examined  several  study  characteristics  that  past  research¬ 
ers  and  other  scholars  had  suggested  might  be  associated  with  SPG 
study  outcomes  as  well  as  other  important  characteristics  of  research 
design.  For  all  three  meta-analyses,  we  examined  six  potential  mod¬ 
erators  that  addressed  differences  in  the  classroom  characteristics  and 
SPG  procedures  (rubric  use  and  training).  Studies  were  grouped 
according  to  the  (a)  grade  level  of  the  students  (Grades  3-8  or  Grades 
9-12),  (b)  class  subject  (science,  technology,  engineering,  and  math 
[STEM]  or  not-STEM),  (c)  rubric  use,  (d)  whether  or  not  students 
received  training  (no  training  vs.  some  training),  (e)  the  length  of 
training  (less  than  six  or  more  than/equal  to  seven  exposures),  and  (f) 
mode  of  training  (either  practice  or  examples  or  multiple  modes 
including  both  examples  and  practice). 

In  our  analyses,  each  ES  associated  with  a  study  was  first  coded 
as  if  it  were  an  independent  estimate  of  the  relationship.  Thus,  we 
report  number  of  reports  ( k )  and  number  of  ESs  (which  might  be 
more  numerous)  in  our  results. 

Software.  The  Comprehensive  Meta- Analysis  (CMA,  Ver¬ 
sion  3.3.070;  Borenstein  et  al.,  2014)  software  package  was  used  to 
calculate  the  within-study  variance  for  each  study,  to  examine 
publication  bias,  to  compute  Hedges’  g,  and  to  calculate  summary 
statistics  when  the  number  of  studies  was  less  than  five.  Robust 
variance  estimation  was  conducted  using  R  Package  (R  Core 
Team,  2016),  with  syntax  provided  by  Tanner-Smith  et  al.  (2016) 
when  the  number  of  studies  to  be  included  was  more  than  five. 

Results 

Our  search  strategies  coupled  with  the  inclusion/exclusion  cri¬ 
teria  identified  33  reports  that  represented  the  retrievable  literature 
on  SPG  in  kindergarten  through  12th  grade.  These  reports  an¬ 
swered  three  unique,  but  related,  questions  on  SPG.  Reports  were 
grouped  into  the  “SPG  as  formative  assessments”  when  the  re¬ 


searchers  investigated  the  long-term  consequences  of  SPG  with  (a) 
a  group  who  participated  in  student  grading,  and  (b)  a  group  who 
did  not  (self-grading,  k  =  20;  peer-grading,  k  —  7).  Reports  that 
compared  mean  grades  given  by  teachers  with  those  given  by 
student-graders  on  the  same  test  were  grouped  into  the  “SPG  as 
summative  assessments”  ( k  —  9).  Reports  that  investigated  the 
correspondence  of  scores  within  the  distribution  of  grades’  given 
by  student-graders  and  teachers  were  included  as  summative  as¬ 
sessment  but  analyzed  separately  (k  =  7).  One  report  tested  all 
three  questions  (Sadler  &  Good,  2006)  and  another  report  tested 
the  long-term  effects  of  both  SPG  (Tseng  &  Tsai,  2007).  These 
results  were  entered  into  each  group  of  studies  to  which  they  were 
relevant.  With  multiple  related  outcomes  appearing  in  some  stud¬ 
ies,  a  total  of  86  usable  g-indexes  and  13  correlations  were  re¬ 
trieved.  Of  note,  no  reports  studied  children  younger  than  third 
grade,  so  our  results  are  bound  by  this  lower  limit  of  generaliz- 
ability.  Also,  no  statistical  outliers  of  sample  size,  Hedges’  g,  or 
Pearson’s  r  were  detected. 

SPG  as  Formative  Assessment 

This  meta-analysis  included  reports  that  addressed  the  research 
question,  “What  are  the  effects  of  using  student  grading  on  stu¬ 
dents’  subsequent  performance?”  A  majority  of  the  reports  (Jk  = 
20)  studied  the  effects  of  self-grading  on  subsequent  test  perfor¬ 
mance  (see  Table  1),  although  a  minority  ( k  =  7)  reported  on 
peer-grading  (see  Table  2).  A  few  reports  (k  =  2)  contributed  ESs 
for  both  self-grading  and  peer-grading  separately;  their  informa¬ 
tion  was  included  in  both  sets  of  data.  Most  of  the  studies  (84%) 
were  carried  out  in  the  United  States  or  Canada.  Sample  sizes 
ranged  from  18  to  667  students. 

Effect  of  self-grading  on  subsequent  academic  performance. 
Table  1  summarizes  the  reports  that  examined  the  effect  of  self¬ 
grading  compared  with  groups  that  had  their  tests  graded  as  usual. 
Some  reports  (k  =  14)  provided  pretest  scores  for  the  self-grading 
and  teacher-grading  groups.  We  calculated  Hedges’  g  for  the 
pretest  scores  and  subtracted  it  from  the  posttest  g-index  to  com¬ 
pute  an  adjusted  g-index  for  30  ESs.  For  the  six  reports  that  did  not 
provide  a  pretest  score,  the  unadjusted  g-index  was  used.  The 
average  report  contributed  more  than  one  ES  (M  =  2.20,  SD  — 
1.88,  minimum  =  1,  maximum  =  9).  Of  the  44  ESs,  32  were 
positive  (i.e.,  students  in  the  self-grading  condition  performed 
better  on  a  subsequent  test  than  students  who  had  not  self-graded) 
and  12  were  negative  (i.e.,  self-grading  students  performed  worse 
than  comparison  students).  Effects  sizes  ranged  from  —0.82  to 
1.75  (see  Table  1).  The  trim-and-fill  procedure  (Duval  &  Tweedie, 
2000a,  2000b),  used  to  test  for  data  censoring,  estimated  no  miss¬ 
ing  ESs  from  the  distribution. 

Using  robust  standard  errors  to  account  for  within-study  clustering 
based  on  a  random  effects  error  model,  the  average  weighted  g-index 
was  .34,  95%  confidence  interval  (Cl)  [.15,  .52],  t2  =  .09,  1 2  = 
87.01. 2  These  results  suggest  that,  on  average,  self-grading  in  the 


2  We  also  used  the  independent  sample  as  the  unit  of  analysis  and 
averaged  across  effect  sizes  within  independent  samples.  The  average 
weighted  4-index  was  .33,  95%  Cl  [.19,  .46],  for  a  random  effects  model. 
The  test  for  heterogeneity  of  effect  sizes  was  significant,  2(23)  =  127.59, 
p  <  ■  001,  t2  =  .08, 12  =  81.97,  indicating  that  the  variability  in  effect  sizes 
was  greater  than  that  which  would  be  expected  because  of  sampling  error 
alone. 
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Table  1 

Studies  Investigating  the  Effect  of  Self-Grading  Skills  on  Subsequent  Performance 


Publication 

Number  of  effect 
sizes  for  each 

Test  subject 

Student 

Exposure  to 

Hedges’ 

Study  name 

type 

study 

n 

Country 

Grade 

training 

training 

g 

Andrade  and  Boulay 

J 

2 

119 

-U.S. 

7  &  8 

Language  arts 

P 

2 

.04 

.1 

.15 

.48 

1.26 

(2003) 

Fontana  and  Fernandes 

J 

2 

667 

Portugal 

3  &  4 

Math 

E  &  P 

6+ 

(1994)a 

Guastello  (200  l)a 

CP 

1 

167 

U.S. 

4 

Language  arts 

E  &  P 

6+ 

Horn  (2009)“ 

D 

1 

38 

U.S. 

3 

Language  arts 

vE  &  P 

1 

6+ 

-.21 

Irwin  (1973)a 

D 

1 

266 

U.S. 

9-12 

Mechanical 

ng 

.27 

drawing 

0 

.11 

Maqsud  and  Pillai 

J 

4 

68 

South  Africa 

9-12 

Science 

none 

(1991)“ 

.17 

.32 

.44 

McDonald  and  Boud 

J 

4 

515 

Barbados 

11 

Humanities 

E  &  P 

6  + 

.49 

(2003) 

.52 

.27 

.49 

Olina  and  Sullivan 

J 

2 

170 

Latvia 

10  &  11 

Psychology 

E  &  P 

2 

.14 

(2004) 

6+ 

.27 

Poplin  (2009 )a 

D 

1 

128 

U.S. 

11 

History 

None 

.25 

Ramdass  and 

J 

2 

42 

U.S. 

5  &  6 

Math 

P 

1 

1.08 

Zimmerman  (2008) 

.35 

Ross  et  al.  (1998)a 

CP 

2 

306 

Canada 

5  &  6 

Math 

E  &  P 

6+ 

.03 

-.03 

Ross  et  al.  (1999)a 

J 

1 

296 

Canada 

4-6 

Language  arts 

E  &  P 

6+ 

.18 

Ross  et  al.  (2001  )a 

CP 

9 

37 

Canada 

11 

Math 

ng 

1 

-.23 

-.82 

-.58 

-.54 

-.37 

-.06 

-.27 

-.15 

.10 

Ross  et  al.  (2002)a 

J 

1 

492 

Canada 

5  &  6 

Math 

E  &  P 

6+ 

.38 

Ross  and  Starling 

J 

3 

143 

Canada 

9 

Geography 

E  &  P 

6+ 

.30 

(2008) 

.82 

.38 

Sadler  and  Good 

J 

1 

46 

U.S. 

7 

Science 

E  &  P 

6+ 

.84 

(2006)a 

Schunk  (1996)a 

J 

2 

44 

U.S. 

4 

Math 

ng 

6 

.27 

1.24 

Wall  (1982)a 

J 

1 

44 

U.S. 

4 

History,  Spanish, 

none 

3 

.00 

reading 

Warner  et  al.  (2012) 

CP 

1 

50 

U.S. 

7 

Math 

E  &  P 

ng 

.09 

Wolter  (1975)a 

D 

3 

18 

U.S. 

6 

Language  arts 

P 

2 

.70 

1.75 

1.64 

Note.  D  =  dissertation/masters  thesis;  J  =  journal  article;  CP  =  Conference  Proceedings;  E  =  training  through  examples  and  P  =  training  through 
practice. 

a  Pretest  and  posttest  scores  were  available  and  an  adjusted  g-index  is  reported  in  the  table.  Adjusted  g-indices  were  computed  by  subtracting  the  pretest 
score  g-index  from  the  posttest  g-index  score. 


classroom  improved  students’  subsequent  performance  by  about  one 
third  of  a  standard  deviation  compared  with  the  performance  of 
students  who  had  not  previously  self-graded. 

Quality.  Variations  in  study  designs  are  an  especially  impor¬ 
tant  characteristic  to  examine  when  the  research  question  involves 
testing  a  causal  connection,  but  the  research  context  leads  to 
studies  with  less-than-ideal  designs.  Therefore,  we  first  examined 
whether  differences  in  study  design  characteristics  led  to  differ¬ 
ences  in  results.  The  dimensions  associated  with  making  strong 


causal  inferences  are  listed  for  each  study  in  Table  2.  Regrettably, 
the  reports  contained  too  little  variation  within  each  variable  to  do 
a  formal  statistical  analysis.  However,  we  could  combine  three  of 
the  design  variables  to  group  studies  into  three  categories  allowing 
different  strengths  of  causal  inference.  Four  reports  employed 
random  assignment,  which  is  a  principle  indicator  of  a  study’s 
ability  to  draw  strong  causal  inferences.  Collectively,  their  average 
weighted  g-index  was  1.00,  95%  Cl  [.64,  1.36],  t2  =  .08,  I2  = 
42.71.  There  were  also  four  studies  that  had  nonequivalent  control 


Table  2 

Quality  Indicators  of  Self-Grading  Studies 


SELF-  &  PEER-GRADING  IN  3RD-12TH  GRADE  CLASSROOMS 
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groups  that  equated  the  groups  prior  to  the  experimental  manipu¬ 
lation  (based  on  a  variety  of  variables,  e.g.,  grade,  achievement). 
Their  average  weighted  g-index  was  .23,  95%  Cl  [.12,  .35],  t2  = 
.00,  /2  =  1 1 .96.  The  rest  of  the  studies  had  designs  with  non¬ 
equivalent  control  groups  without  a  priori  equating.  Twelve  studies 
were  averaged  to  have  a  weighted  g-index  of  .21,  95%  Cl  [.02, 
.40],  t2  =  .08,  I2  =  87.63.  Taken  together,  these  results  suggest 
that  reports  with  a  stronger  experimental  design  showed  larger 
long-term  effects  of  self-grading,  though  a  formal  statistical  test  of 
this  finding  awaits  future  research. 

There  are  also  conclusions  that  can  be  drawn  about  weaknesses 
in  the  studies’  designs  as  a  collection.  Most  notably,  all  but  four  of 
the  studies  administered  the  treatment  at  the  level  of  the  classroom 
but  analyzed  the  data  using  the  student  as  the  unit.  Such  analyses 
do  not  take  intraclass  dependencies  into  consideration.  This  is  a 
failing  of  the  studies  (but  one  that,  regrettably,  occurs  in  many 
areas  of  classroom  research). 

Table  Cl  in  supplemental  file  C  of  the  online  supplemental 
materials  presents  study  characteristics  that  might  influence  the 
validity  of  conclusions  after  the  data  has  been  collected.  Again,  the 
studies  did  not  reveal  enough  variation  in  these  characteristics  to 
allow  credible  statistical  analyses  of  their  influence  on  outcomes. 
These  reveal  that,  as  a  group,  study  outcomes  were  likely  not 
influenced  by  their  level  of  overall  or  differential  attrition.  Also, 
floor  and  ceiling  effects  on  pretests  (when  they  were  used)  and 
posttests  are  generally  not  an  area  of  concern.  Three  studies 
suggested  students  “volunteered.”  We  suspect  that  the  “no  men¬ 
tion”  studies  also  used  such  samples  of  convenience.  Thus,  al¬ 
though  combining  results  across  studies  enhances  the  heterogene¬ 
ity  of  included  classrooms,  the  issue  of  how  these  samples  of 
convenience  (especially  the  use  of  volunteers)  might  differ  from 
all  classrooms  remains  unanswered.  It  also  highlights  the  need  for 
researchers  to  present  more  complete  descriptions  of  their  sam¬ 
pling  procedures. 

Moderator  analyses.  We  conducted  analyses  exploring  five 
moderators,  grouped  according  to  classroom  characteristics  and 
SPG  procedures  (i.e.,  use  of  rubrics  and  training),  of  the  effects  of 
self-grading  on  subsequent  test  performance.  To  aid  in  interpreta¬ 
tion,  we  performed  an  analysis  to  determine  whether  any  relation¬ 
ship  existed  between  the  six  moderator  variables.  Only  one  such 
correlation  is  worth  mentioning;  perhaps  not  surprisingly,  studies 
with  multiple  modes  of  training  showed  a  significant  positive 
relationship  with  length  of  training  exposure  (r  =  .76,  p  =  000). 
This  correlation  analysis  suggests  that  the  training  variables  were 
highly  correlated  with  each  other. 

Classroom  characteristics.  Thirteen  studies  used  students  in 
elementary  or  middle  school  as  subjects  and  seven  studies  used 
high  school  students.  A  moderator  analysis  revealed  no  significant 
difference  in  ESs  for  studies  with  younger  compared  with  older 
students,  r(  1 3.5)  =  1.11,/?  =  .29.  Subjects  were  grouped  according 
to  STEM  ( k  =  9)  subjects  compared  with  other  subjects  that 
included  language  arts  (k  =  5),  mechanical  drawing  (k  =  1), 
humanities  ( k  =  3),  and  psychology  (k  =  1).  A  moderator  analysis 
showed  no  significant  effect  of  self-grading  outcomes  for  STEM 
classes  compared  with  other  subjects,  r(17)  =  0.08,  p  =  .67. 

Training.  Three  variables  captured  differences  in  student 
training.  First,  studies  were  coded  on  whether  they  trained  the 
students  at  all  (“yes”  or  “no”;  training  presence).  Only  three 
studies  provided  no  training  to  its  students;  most  studies  (k  =  17) 


gave  some  training.  Thus,  this  moderator  was  analyzed.  Then, 
studies  were  coded  on  whether  they  used  multiple  modes  of 
training  (training  type;  i.e.,  use  of  both  practice  and  examples).  Six 
studies  trained  through  either  practice  or  examples  and  1 1  used 
both  modes.  A  moderator  analysis  showed  no  significant  effect  of 
receiving  one  type  of  training  or  receiving  multiple  types  of 
training,  f(9. 1 4)  =  0.001,/?  =  .98. 

Lastly,  studies  were  grouped  according  to  the  frequency  with 
which  students  self-graded  (grading  exposures).  Studies  whose 
students  self-graded  less  than  six  times  were  considered  to  have 
received  short-term  exposure  (k  *=  9),  and  studies  whose  students 
self-graded  on  seven  or  more  occasions  were  considered  to  have 
received  long-term  exposure  (k  =  10).  One  study  did  not  report 
number  of  grading  exposures.  The  experience  of  self-grading  less 
than  six  times  appeared  to  be  a  natural  cutoff.  Specifically,  studies 
that  described  a  short-term  intervention  typically  enumerated  the 
number  of  occasions  of  self-grading  compared  with  studies  imple¬ 
menting  self-grading  practices  over  the  course  of  the  study/semes¬ 
ter  typically  did  not  give  an  exact  number;  rather,  these  studies 
made  self-grading  an  integral  part  of  instruction.  A  moderator 
analysis  did  not  show  any  effect  of  frequency  of  self-grading 
exposure  on  subsequent  tests,  ?( 1 5 . 1 1 )  =  0.85,  p  =  .41. 

Rubric  use.  Of  the  20  reports,  19  (95%)  explicitly  indicated 
that  rubrics  were  used;  Schunk  (1996)  made  no  mention  of  rubric 
use.  Eleven  reports  described  the  use  of  specific  rubrics  to  aid  the 
students  in  the  self-grading  process,  that  is,  a  rubric  that  provided 
a  clear  set  of  criteria  and  described  varying  levels  of  quality  for  a 
specific  assignment  (“specific”  rubric;  Andrade  &  Valtcheva, 
2009).  Two  studies  used  general  rubrics  in  the  grading  process 
(that  is,  rubrics  that  provide  some  criteria  reflecting  the  underlying 
skills  and  knowledge  within  the  defined  domain,  but  ultimately 
leaves  the  grader  to  make  an  overall  judgment  on  the  quality  of  the 
work)  and  several  did  not  describe  the  rubric  used  (k  =  6).  Only 
seven  reports  indicated  the  use  of  students  to  create  rubrics,  most 
of  these  reports  (86%)  originated  from  the  same  research  group 
(Ross  and  colleagues). 

Effect  of  peer-grading  on  subsequent  performance.  Table  3 
summarizes  seven  reports  that  contributed  1 1  ESs  for  the  analysis 
of  the  difference  on  subsequent  tests  between  students  who  graded 
their  peers  and  those  who  did  not.  Most  of  the  reports  used 
nonequivalent  control  groups  without  equating  as  the  method  of 
assignment  ( k  =  6).  Of  note,  the  one  study  that  employed  random 
assignment  to  condition  demonstrated  the  largest  effect  of  peer¬ 
grading  ( g  =  .69).  Generally  speaking,  the  peer-grading  studies 
had  no  serious  issues  with  attrition  or  measurement  ceilings  and 
floors  (see  supplemental  file  C,  table  C2).  Many  of  the  studies  of 
peer-grading  were  carried  out  in  elementary  and  middle  schools 
(k  =  4)  and  in  a  language  arts  course  ( k  =  6).  In  regard  to  training, 
almost  all  studies,  with  the  exception  of  one  (Farrell,  1977),  trained 
their  students  to  peer-grade  and  most  (k  =  5)  provided  students 
with  many  opportunities  (i.e.,  more  tjaan  10  peer-grading  in¬ 
stances)  to  grade  their  peers.  All  of  the  studies  reported  that  they 
had  used  rubrics  to  guide  the  student  grading,  with  a  majority 
indicating  the  use  of  specific  rubrics  (four  of  seven  reports,  57%; 
use  of  general  rubrics,  k=  1;  information  not  given,  k  =  2).  Only 
one  report  indicated  that  the  students  aided  in  rubric  development 
(Sadler  &  Good,  2006). 

A  total  of  seven  reports  with  adjusted  means  were  included  in 
the  analysis.  The  average  report  contributed  more  than  one  ES 
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Table  3 


Studies  Investigating  the  Effect  of  Peer-Grading  Skills  on  Subsequent  Performance 


Study  name 

Publication 

type 

Number  of 
effect  sizes 
for  each 
study 

Type  of  design 

Pre-/posttest 

design 

n 

Grade 

Test  subject 

Student 

training 

Exposure  to 
training 

Adjusted 
g  index2 

Califano  (1987) 

D 

4 

Q  without  equating 

Yes 

41 

5 

Language  arts 

E 

18 

.02 

48 

6 

Language  arts 

E 

18 

•  JJ 

.28 

Farrell  (1977) 

D 

2 

Q  without  equating 

Yes 

91 

11' 

Language  arts 

None 

12 

-.33 

.32 

12 

.27 

Horn  (2009) 

D 

1 

Q  without  equating 

Yes 

37 

3 

Language  arts 

E&P 

1 

.45 

Karegianes  et  al.  (1980) 

J 

1 

Q  without  equating 

No 

49 

10 

Language  arts 

P 

10 

.41 

Pierson  (1966) 

D 

1 

Q  without  equating 

Yes 

153 

9 

Language  arts 

E 

13 

.15 

Sadler  and  Good  (2006) 

J 

1 

Q  without  equating 

Yes 

73 

7 

Science 

E&P 

10 

.22 

Wise  (1992) 

D 

1 

T 

Yes 

134 

8 

Language  arts 

P 

1 

.69 

Note.  All  students  were  described  as  being  in  mixed  achievement  classrooms,  with  the  exception  of  Karegianes  et  al.  (1980),  which  reported  on  students 
who  were  below-grade  achievers.  All  retrieved  studies  were  conducted  in  the  United  States.  Studies  were  listed  more  than  once  when  more  than  one 
independent  sample  was  reported.  D  =  dissertation/master’ s  thesis;  J  =  journal  article;  Q  =  quasi-experiment;  T  =  true-experiment;  E  =  training  through 
examples;  P  =  training  through  practice. 

Adjusted  g-index  was  computed  by  subtracting  the  pretest  score  g-index  from  the  posttest  g-index  score. 


(M  =  1.57,  SD  =  2.19,  minimum  =  1,  maximum  =  4),  which 
ranged  from  -0.33  to  .69.  Using  RVE  with  random  effects  as¬ 
sumptions,  the  average  weighted  g-index  of  adjusted  means  was 
.29,  95%  Cl  [.08,  .50],  t2  =  .03,  I2  =  39.21.  This  analysis 
suggested  that  peer-grading  shows  a  positive  effect  on  subsequent 
test  performance. 

The  trim-and-fill  procedure  (Duval  &  Tweedie,  2000a,  2000b), 
used  to  test  for  data  censoring,  estimated  two  missing  ESs  smaller 
than  the  observed  mean  and  no  evidence  of  missing  ESs  larger 
than  the  overall  mean.  This  procedure  estimated  with  random 
effects  error  modeling  (in  CMA)  that  the  mean  would  decrease  by 
.068.  Thus,  the  analysis  suggested  that  the  observed  average 
weighted  g-estimate  might  be  lower  than  expected  had  the  data  not 
been  censored  in  some  way. 

In  summary,  studies  demonstrated  that  both  self-  and  peer¬ 
grading  positively  affected  subsequent  achievement  performance, 
with  self-grading  showing  a  larger  positive  long-term  effect.  In 
regard  to  the  effects  of  self-grading  exposure,  studies  that  imple¬ 
mented  random  assignment  appeared  to  show  larger  effects  than 
studies  that  did  not  have  equivalent  control  groups.  We  found  little 
variation  in  student  training  and  rubric  use;  nearly  all  students 
were  trained  and  used  rubrics.  We  found  no  studies  that  have 
looked  at  student  grading  prior  to  third  grade. 

SPG  as  Summative  Assessment 

This  meta-analyses  included  reports  that  answered  the  research 
questions  “How  do  grades  assigned  by  students  compare  with 
grades  assigned  by  teachers  on  the  same  outcome  measure?”  and 
“What  is  the  degree  of  similarity  (correlation)  of  a  student’s 
position  on  a  grade  distribution  when  grades  are  assigned  by 
students  and  by  teachers?”  (see  Table  4).  Analyses  were  conducted 
separately  for  each  research  question.  The  literature  search  iden¬ 
tified  reports  ( k  =  9)  with  31  ESs  that  compared  the  means  of 
students’  and  teachers’  assigned  grades.  Researchers  reported  ESs 
for  self-grading  ( k  =  4),  peer-grading  (k  =  2),  or  both  (k  =  3).  The 
average  report  contributed  more  than  one  ES  ( M  =  3.44,  SD  = 


2.40,  minimum  =  1,  maximum  =  12).  Of  the  31  ESs,  13  were 
positive  and  18  were  negative.  A  positive  ES  indicated  that  the 
student  gave  higher  marks  than  the  teacher,  and  a  negative  ES 
indicated  that  the  student  gave  lower  marks  than  the  teacher. 
Sample  sizes  ranged  from  five  to  184  students.  A  few  reports  ( k  = 
3)  contributed  ESs  for  both  questions  and  thus  were  included  in 
both  sets  of  data.  Table  4  summarizes  the  findings  that  examined 
mean  differences  between  students  and  teachers  in  the  grades  they 
assigned,  with  self  and  peer  g-indexes  appearing  in  Columns  1 1 
and  12,  respectively. 

Collectively,  the  studies  were  performed  in  either  music  classes 
( k  =  4)  or  STEM-related  classes  ( k  =  5).  Notably,  all  the  reports 
originated  from  the  United  States  or  Taiwan.  Furthermore,  no 
studies  from  the  United  States  occurred  in  high  school,  whereas 
most  of  the  reports  from  Taiwan  occurred  in  high  school.  Only  one 
study  reported  that  the  test  counted  toward  the  final  grade  (Sadler 
&  Good,  2006). 

Quality.  Weaknesses  (and  variation)  in  study  designs  for 
making  causal  inferences  is  less  of  an  issue  in  research  that  (a) 
compares  grades  assigned  by  students  and  teachers,  and  (b) 
calculates  the  correlation  between  these  assigned  grades.  This  is 
because  the  stimulus  for  grading  (the  students’  tests)  and  the 
context  in  which  the  grading  takes  place  (the  school,  classroom, 
subject  matter,  etc.)  are  yoked  for  each  teacher-student  pair  in 
the  study.  Only  the  grader  differs  between  conditions.  Although 
it  might  be  feasible  to  study  “peer”  grading  by  creating  exper¬ 
imental  stimuli  that  vary  in  controlled  ways  (and  it  is  not  clear 
that  such  a  strategy  would  lead  to  more  plausibly  causal  infer¬ 
ences  than  allowing  stimuli  to  vary  naturally),  such  designs 
could  not  be  used  to  study  self-grading.  Further,  except  for  the 
fact  that  few  of  the  studies  mentioned  whether  students  were 
lost  from  the  original  target  population,  this  issues  of  attrition 
and  measurement  ceilings  appear  to  be  inconsequential  for 
these  studies  (see  supplemental  file  C,  table  C3). 

One  study  (Sadler  &  Good,  2006)  employed  random  assignment 
to  conditions  regarding  whether  or  not  the  students  self-graded  or 
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Table  4 

Studies  Investigating  the  Difference  Between  Student-  and  Teacher-Graded  Tests 


Number  of 
effect  sizes 


Study  name 

Pub 

type 

for  each 
study 

Type  of 
design 

Country 

Aitchison  (1995) 

D 

1 

Q  without 
equating 

U.S. 

Chang  et  al. 

(2012) 

J 

2 

Q  without 
equating 

Taiwan 

Davis  (1981) 

D 

4 

Q  without 
equating 

U.S. 

Kruse  (2006) 

UW 

1 

Q  with 
equating 

U.S. 

Lin  et  al.  (2002) 

J 

1 

Q  without 
equating 

Taiwan 

Sadler  and  Good 
(2006) 

J 

5 

T 

U.S. 

Sung  et  al. 

(2005) 

J 

2 

Q  without 
equating 

Taiwan 

Sung  et  al. 

(2010) 

J 

18 

Q  without 
equating 

Taiwan 

Tseng  and  Tsai 
(2007) 

J 

3 

Q  without 
equating 

Taiwan 

n 

Grade 

Student 

type 

Test  subject 

Student 

training 

84  ' 

7,  8 

M 

Music 

None 

72  9,  10,  11,  12 

M 

Computer 

E&P 

12 

5 

M 

Music 

E&P 

5 

6 

M 

Music 

E&P 

18 

6 

M 

Music 

E 

57 

10 

M 

Engineering 

None 

49 

7 

M 

Science 

E 

24 

7 

M 

Science 

E 

24 

7 

M 

Science 

E 

37 

9 

M 

Computer  Science  None 

29 

7 

H 

Music 

E- 

32 

8 

H 

Music 

E 

60 

7 

M 

Music 

E 

48 

8 

M 

Music 

E 

27 

7 

L 

Music 

E 

30 

8 

L 

Music 

E 

184 

10 

M 

Computer 

None 

g  index 
for 
self¬ 
grading 

g  index 
for 
peer¬ 
grading 

r 

1.53 

— 

.42 

.21 

.44 

.83  (self)/.28 
(peer) 

-.24 

— 

.57 

-.41 

— 

Ng 

-.16 

— 

Ng 

- .47 

— 

Ng 

.13 

'  - 

.63 

— 

-.17 

.63 

_ 

-.18 

.91 

— 

-.37 

ng 

— 

-.18 

Ng 

.12 

-.36 

.98  (self) 

-.46 

— 

Ng 

-.28 

— 

Ng 

-.52 

-1.51 

.41  (self) 

-.37 

-1.18 

.52  (self) 

-.38 

-1.02 

Ng 

.07 

-.68 

Ng 

.47 

.69 

Ng 

.34 

.20 

Ng 

— 

.64 

.71 

— 

.34 

.56 

— 

.21 

.57 

Note.  Rubrics  were  given  to  students  in  all  studies  to  assist  with  grading.  Studies  were  listed  more  than  once  when  more  than  one  independent  sample 
was  reported.  D  =  dissertation/master’s  thesis;  J  =  journal  article;  UW  =  unpublished  work;  Q  =  quasi-experiment;  T  =  true  experiment;  M  =  mixed 
levels  of  student  achievement;  L  =  low  achievement  levels;  H  =  high  achievement  levels;  E  =  training  through  examples;  P  =  training  through  practice; 
ng  =  not  given. 


peer-graded.  Like  all  other  studies,  the  stimuli  (tests)  were  still 
yoked  for  students  and  teachers.  This  study  reported  one  ES  for 
self-grading  (g  =  .12)  and  four  ESs  for  peer-grading  (average 
weighted  g-index  =  —.28). 

Rubric  use.  Of  the  nine  reports,  all  indicated  that  rubrics  were 
used.  Most  of  the  reports  (k  =  6)  described  the  use  of  general 
rubrics  to  aid  the  students  in  the  grading  process,  whereas  a  small 
proportion  of  reports  used  specific  rubrics  ( k  =  3).  Students  were 
not  often  included  in  rubric  creation  (k  =  6),  with  a  few  reports 
indicating  student  involvement  in  rubric  creation  ( k  =  3). 

Student  training.  Notably,  four  of  the  9  studies  did  not  train 
their  students  to  grade.  Some  of  the  studies  used  examples  only  to 
train  the  students  to  grade  ( k  =  3),  and  others  (k  =  2)  used  both 
examples  and  practice. 

Differences  in  mean  grades  assigned  by  self  and  teacher. 

For  the  analysis  examining  differences  between  self-  and  teacher¬ 
grading,  seven  reports  contributed  to  the  summary  statistic.  Using 
RVE  with  random  effects  assumptions,  the  average  weighted 
g-index  of  means  was  .17,  95%  Cl  [-.41,  .76],  t2  =  .30,  I2  = 
89.82.  These  results  suggest  that,  on  average,  primary  and  second¬ 
ary  school  students  assigned  themselves  grades  that  are  not  sig¬ 
nificantly  different  from  the  grades  that  teachers  assigned  when 
grading  the  same  outcome. 

The  trim-and-fill  procedure  (Duval  &  Tweedie,  2000a,  2000b), 
used  to  test  for  data  censoring,  estimated  two  missing  ESs  larger 
than  the  overall  mean  and  no  evidence  of  missing  ESs  smaller  than 
the  overall  mean.  Specifically,  the  analysis  (performed  in  CMA) 


with  random  effects  error  estimated  that  the  average  would  in¬ 
crease  by  .22  if  the  missing  studies  were  included.  Thus,  this 
analysis  suggested  that  the  estimate  is  lower  than  might  have  been 
found  without  publication  bias  or  another  type  of  data  censoring 
(e.g.,  selective  reporting  of  results  by  authors). 

Differences  in  mean  grades  assigned  by  peers  and  teacher. 
A  total  of  five  reports  contributed  to  the  summary  statistic  exam¬ 
ining  the  difference  between  grades  given  by  peers  and  by  teachers 
on  the  same  test.  Using  RVE  with  random  effects  assumptions,  the 
average  weighted  g-index  of  means  was  -.04,  95%  Cl  [-.60,  .52], 
t2  =  .60,  I2  =  96.62.  These  results  suggest  that,  on  average, 
primary  and  secondary  school  students  assigned  grades  to  their 
peers  that  are  not  significantly  different  from  the  grades  that 
teachers  assigned  when  grading  the  same  outcome. 

The  trim-and-fill  procedure  (Duval  &  Tweedie,  2000a,  2000b), 
used  to  test  for  data  censoring,  estimated  with  random  effects 
modeling  (in  CMA)  no  missing  ESs  smaller  or  larger  than  the 
overall  mean.  Thus,  this  analysis  suggested  that  the  estimate  ap¬ 
proximates  what  might  have  been  foun^l  without  publication  bias 
or  another  type  of  data  censoring  (e.g.,  selective  reporting  of 
results  by  authors). 

Distribution  similarity.  There  were  a  total  of  eight  reports 
that  contained  13  correlations  between  student-assigned  and 
teacher-assigned  grades.  Sample  sizes  ranged  from  17  to  184 
students.  Correlations  between  student  graders  and  teacher-graders 
ranged  between  r  =  .42  and  .71.  A  total  of  six  reports  provided 
correlations  between  self  and  teacher  grades,  and  four  reports 
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provided  correlations  between  peer  and  teacher  grades  (two  reports 
contributed  correlations  for  both  self-analysis  and  peer-analysis). 

The  trim-and-fill  procedure  (Duval  &  Tweedie,  2000a,  2000b) 
estimated  no  missing  ESs  less  than  the  overall  mean  and  two  ESs 
more  than  the  overall  mean.  An  estimate  of  the  adjusted  weighted 
overall  mean  difference,  including  the  identified  missing  values, 
would  increase  the  correlation  estimate. 

Using  RVE  with  random  effects  assumptions,  the  average 
weighted  r  value  for  the  correspondence  between  self-grading  and 
teacher-grading  (based  on  seven  reports)  was  .67,  95%  Cl  [.41, 
.93],  t2  =  .05,  1 2  =  95.29.  The  trim-and-fdl  procedure  (Duval  & 
Tweedie,  2000a,  2000b)  estimated  no  missing  ESs  less  than  the 
overall  mean  and  two  ESs  more  than  the  overall  mean.  The 
procedure,  with  random  effects  modeling  in  CMA,  suggested  that 
the  correlation  estimate  would  increase  by  .086.  This  procedure 
used  for  the  detection  of  publication  bias  indicated  that  the  average 
weighted  r  value  is  less  than  expected. 

Based  on  four  reports,  the  average  weighted  r  value  for  the 
correlation  between  grades  given  by  peers  and  grades  given  by 
teachers  was  .68,  95%  Cl  [.32,  .87],  t2  =  .24,  I2  =  97.24.  The 
trim-and-fill  procedure  (Duval  &  Tweedie,  2000a,  2000b)  esti¬ 
mated  no  missing  ESs  less  than  the  overall  mean  and  one  ES  more 
than  the  overall  mean.  The  procedure,  with  random  effects  mod¬ 
eling  in  CMA,  suggested  that  the  correlation  estimate  would  in¬ 
crease  by  .084,  thereby  indicating  that  the  average  weighted  r 
value  is  less  than  expected.  Taken  together,  these  results  suggest 
that,  on  average,  primary  and  secondary  school  students  assigned 
themselves  and  their  peers’  grades  that  corresponded  well  with  the 
grades  that  instructors  assigned  when  grading  the  same  outcome,  at 
least  with  regard  to  where  particular  students  were  placed  in  the 
distribution  of  all  students. 

In  summary,  the  reports  investigating  SPG  as  summative  assess¬ 
ment  is  sparse.  Students  in  fifth  to  12th  grades  evaluated  students’ 
tests  similarly  to  teachers;  both  grades  assigned  and  the  distribu¬ 
tion  of  grades  showed  a  similarity  to  teacher  grades.  However, 
inferences  are  limited  given  the  incomplete  representation  of  cul¬ 
tures  and  students  from  primary  and  secondary  schools. 

Discussion 

This  research  synthesized  the  literature  examining  self-grading 
and  peer-grading  in  the  third  through  12th  grade  levels  using 
criterion-referenced  testing.  The  survey  of  previous  scholar  writing 
and  the  meta-analyses  contribute  to  the  current  body  of  literature 
by  examining  self-graders  and  peer-graders  separately  using 
criterion-referenced  feedback  and  restricting  studies  to  those  that 
implemented  an  experimental  or  quasi-experimental  design.  This 
exercise  helps  disentangle  numerous  issues  not  separated  in  pre¬ 
vious  review  efforts  and  provided  formal  tests  for  a  (regrettably) 
few  moderators  that  scholars  and  educators  have  posited  could 
mediate  the  SPG  process.  It  also  uncovered  important  questions 
that  have  gone  unanswered. 

Our  findings  revealed  that  most  of  the  literature  surrounding 
SPG  in  primary  and  secondary  classrooms  examined  SPG  as 
formative  assessment,  that  is,  to  help  students  do  better  on  future 
tests.  As  expected,  the  practice  of  self-grading  in  the  classroom 
showed  a  nontrivial  effect  on  students’  subsequent  grades. 

One  important  issue  to  note  is  the  paucity  of  reports  investigat¬ 
ing  SPG  in  kindergarten  through  second  grades.  Although  our 


meta-analyses  sought  to  include  reports  containing  this  young 
group  of  children,  no  studies  were  found  that  examined  any  of  our 
questions  at  these  grade  levels.  Also,  we  did  not  encounter  any 
study  that  investigated  peer-grading  in  high  school.  Thus,  we  have 
to  limit  our  generalizations  to  grading  practices  in  the  third  to  12th 
grades  for  the  long-term  consequences  of  SPG,  and  the  fifth  to 
12th  grades  for  SPG  as  summative  assessment.  But  these  omis¬ 
sions  may  not  be  random.  It  would  not  be  surprising  if  teachers  feel 
that  children  8-years-old  or  younger  have  not  developed  the  meta- 
cognitive  skills  needed  to  be  graders.  They  may  also  lack  the 
emotional  security  to  take  criticism  from  peers  (as  might  adoles¬ 
cents).  For  peer-grading,  high  school  teachers  may  feel  that  the 
importance  of  grades  for  college  admissions  and  the  heightened 
role  of  social  comparison  among  adolescents  makes  peer-grading 
problematic  in  high  school.  Whether  these  conjectures  on  our  part 
actually  do  exist,  it  would  be  fruitful  avenues  for  future  research. 

What  Are  the  Effects  of  Student  Grading  on 
Subsequent  Test  Performance? 

Our  clearest  and  arguably  most  important  result  is  that  self¬ 
grading  increased  academic  performance  on  subsequent  tests  by 
about  one  third  of  a  standard  deviation,  suggesting  that  active 
engagement  in  the  grading  process  results  in  beneficial  effects  for 
student  learning.  These  findings  concur  with  and  provide  empirical 
evidence  for  Ross’s  (2006)  findings  from  a  more  general  review  of 
the  literature. 

It  is  important  to  recognize  that  the  self-grading  condition  was 
compared  with  a  teacher-graded  or  no-grading  condition  (an  as 
usual  condition).  Teachers  who  are  prepared  to  provide  students 
with  instruction  on  grading  must  have  a  clear  sense  of  what  is 
important,  and  their  instruction  is  likely  to  be  reasonably  well- 
aligned  with  their  grading  criteria.  In  this  sense,  teachers  in  the 
SPG  condition  are  more  likely  to  have  clearer  lesson  objectives 
than  those  who  are  not. 

Studies  that  controlled  for  differences  between  groups  through 
random  assignment  showed  larger  effects  on  the  long-term  effect 
of  self-grading.  In  addition,  this  finding  is  consistent  with  other 
similar  meta-analyses  in  SPG  research  at  the  college  level  (Fal- 
chikov  &  Boud,  1989;  Falchikov  &  Goldfinch,  2000),  which 
suggested  effects  of  methodological  quality  (based  on  various 
experimental-level  variables  in  addition  to  completeness  in  report¬ 
ing  of  information)  on  the  reported  outcomes  of  SPG  research. 

Given  that  training  of  students  to  self-grade  is  often  espoused  as 
an  essential  ingredient  of  successful  self-grading,  it  is  surprising 
that  our  indicators  of  training  (presence,  length,  and  type)  did  not 
moderate  the  effects  of  self-grading  on  student  achievement.  How¬ 
ever,  all  of  the  studies  provided  students  with  rubrics  and  most 
provided  training,  which  suggests  a  high  degree  of  student  super¬ 
vision  and  structure  present  throughout  the  studies.  This  scaffold¬ 
ing  of  instruction  to  students  may  have  diminished  our  ability  to 
find  effects  of  training  because  of  low  variability  between  studies. 
It  does  point  out  that  training  is  viewed  as  a  critical  element  to  the 
success  of  SPG  and  suggests  that  future  studies  that  experimentally 
manipulated  the  presence,  amount,  and  type  of  training  would  be 
highly  informative. 

Similar  to  the  self-grading  experience,  the  peer-grading  experi¬ 
ence  also  benefited  subsequent  test  performance  (even  when  we 
used  a  robust  procedure  for  estimating  effects).  However,  the  small 
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number  of  studies  contained  in  the  peer-grading  literature  limits 
the  generalizability  of  the  findings  to  the  U.S.  population.  Al¬ 
though  the  current  results  contributed  additional  studies  on  peer¬ 
grading  and  limited  studies  to  ones  that  used  criterion-referenced 
assessments,  its  findings  are  similar  to  Topping’s  (2013)  review 
that  suggested  that  although  the  situation  is  slowly  changing,  more 
studies  are  needed  that  investigate  peer-grading  in  primary  and 
secondary  classrooms. 

Importantly,  the  effect  of  peer-grading  was  smaller  compared 
with  the  effect  of  self-grading  on  subsequent  test  performance. 
This  finding  suggests  that  self-grading  may  affect  metacognitive 
skills  (e.g.,  reflection  and  internalization)  more  than  peer-grading 
among  third  through  12th  graders.  Self-grading  relies  on  a  stu¬ 
dent’s  metacognitive  competencies  by  drawing  on  self¬ 
observation,  self-judgment,  task  analysis,  self-control,  and  so  forth 
(Brown  &  Hands,  2013).  Self-grading  may  improve  student  aca¬ 
demic  performance  by  teaching  students  to  use  and  rehearse  meta¬ 
cognitive  skills.  Self-regulation  is  related  to  academic  achieve¬ 
ment;  students  that  are  capable  of  setting  goals,  making  flexible 
plans  to  meet  them,  and  monitoring  their  progress  are  more  likely 
to  perform  better  in  school  than  students  who  are  not  (Andrade  & 
Valtcheva,  2009),  as  evidenced  by  improvement  in  self-grading 
skills  over  time  (Butler  &  Lee,  2010).  Furthermore,  self-regulation 
skills  (setting  goals,  deliberating  about  strategies,  managing  moti¬ 
vation)  are  argued  to  be  the  most  useful  skill  for  students  to  be 
effective  learners  (Butler  &  Winne,  1995). 

With  regard  to  the  theoretical  rationales  for  SPG  writ  large,  the 
existing  studies  did  not  directly  measure  mediating  variables  sug¬ 
gested  by  the  relevant  theories.  For  example,  based  on  theoretical 
explanations  of  SPG’s  effect  on  learning,  the  students  exposed  to 
SPG  should  show  increases  in  their  sense  of  autonomy,  self¬ 
monitoring,  and  sense  of  fairness  in  grading.  These  have  never 
been  measured  in  SPG  studies,  but  the  results  were  consistent  with 
the  theoretical  predictions.  It  would  be  both  interesting  and  infor¬ 
mative  if  future  studies  of  SPG  included  measures  of  these  vari¬ 
ables. 

Do  Mean  Grades  Differ  When  Students  or  Teachers 
Are  the  Graders? 

The  meta-analysis  investigating  student  and  teacher  mean  grade 
comparisons  found  little  difference  between  grades  assigned  by 
students  and  teachers.  Self-grading  means  were  in  the  predicted 
direction  ( g  =  .17),  but  not  significantly  so,  perhaps  related  to  the 
lack  of  power  to  test  this  effect.  Peer-assigned  grades  hardly 
differed  at  all  from  teacher  grades  ( g  =  —.04).  This  meta-analysis 
perhaps  serves  to  somewhat  allay  teachers’  conception  that  stu¬ 
dents  are  unable  to  grade  themselves  and  others  without  grade 
inflation;  instead,  it  appears  that  students  in  primary  and  secondary 
classrooms  give  scores  similar  in  mean  to  scores  given  by  teachers 

Meta-analyses  investigating  the  same  research  questions  at  the 
college  level  showed  that  students  graded  tests  between  one  fourth 
to  one  half  of  a  standard  deviation  higher  than  teachers  (Falchikov 
&  Boud,  1989;  Falchikov  &  Goldfinch,  2000).  Thus,  it  appears 
that  the  difference  between  students  and  teachers  emerges  as 
grades  become  more  consequential.  First,  these  findings  may  stem 
from  more  supervision  and  structure  in  the  grading  process  for 
younger  students,  as  all  reports  in  the  primary  and  secondary 
school  meta-analysis  provided  rubrics  for  the  student  to  grade. 


Second,  more  competition  exists  as  students  move  through  school 
and  grades  have  longer  term  and  more  direct  implications  for 
continuing  education,  which  might  lead  to  a  greater  pressure  to 
grade  oneself  with  more  leniency  and  their  peers  with  more  rivalry 
(Sebba  et  ah,  2008).  Third,  students  in  earlier  grades  may  be  more 
inclined  to  explicitly  follow  teachers’  instructions.  College  stu¬ 
dents  are  typically  given  more  autonomy  in  the  academic  atmo¬ 
sphere,  whereas  primary  and  secondary  school  students  typically 
experience  more  direct  guidance  from  their  teachers.  Lastly,  tests 
at  higher  grade  levels  contain  more  complexity,  which  might 
require  a  greater  level  of  inference  and/or  metacognitive  abilities 
(Hovardas  et  ah,  2014).  Future  research  might  directly  investigate 
the  differences  in  SPG  implementation  as  students  move  through 
levels  of  schooling.  Taken  together,  our  synthesis  confirms  that 
developmental  and  contextual  differences  exist  in  the  implemen¬ 
tation  of  SPG  in  the  college  versus  the  primary  and  secondary 
school  settings. 

Importantly,  most  of  the  reports  did  not  have  the  test  count 
toward  the  final  grade,  thereby  making  the  student-grading  process 
a  low-stake  activity  in  the  classroom.  Also,  no  studies  from  the 
United  States  occurred  in  high  school,  whereas  most  of  the  reports 
from  Taiwan  occurred  in  high  school. 

What  Is  the  Correlation  Between  Student  Graders 
and  Teachers? 

The  grades  assigned  by  middle  and  high  school  students  dem¬ 
onstrated  a  moderate  relationship  with  those  assigned  by  teachers 
in  regard  to  the  placement  of  students  within  grade  distributions 
(r  =  .67).  This  finding  suggests  that  about  45%  of  the  variance  in 
grades  was  shared  by  students  and  teachers.  Notably,  the  correla¬ 
tions  were  all  in  the  positive  direction,  which  suggests  generally 
good  correspondence  between  students  and  teachers.  It  also  sug¬ 
gests  that  future  research  should  examine  what  explains  the  re¬ 
maining  variance  in  both  student  and  teacher-graded  scores.  Al¬ 
though  a  stronger  emphasis  on  effort  reflected  in  grades  given  by 
students  is  often  proposed  (Stipek,  1981;  Stipek  &  Tannatt,  1984), 
we  know  of  no  study  that  directly  investigates  this  supposition  or 
whether  its  influence  diminishes  over  across  development.  Taken 
together,  moderate  correlations  and  good  mean  correspondence 
between  student  and  teacher  grades  indicates  that  SPG  can  be 
successful  as  a  summative  assessment  in  primary  and  secondary 
classrooms,  though  the  weight  given  to  test  grades  in  determining 
a  student’s  overall  grades  is  yet  to  be  tested. 

SPG  as  Formative  or  Summative  Assessment 

More  studies  investigated  the  effect  of  SPG  as  formative  assess¬ 
ment,  thereby  giving  students  the  opportunity  of  revision  and 
improvement  over  the  long  term.  Studies  of  SPG  in  primary  and 
secondary  classrooms  rarely  implemented  SPG  as  summative  as¬ 
sessment,  which  involves  students  grading  their  own  or  others’ 
work  for  a  final  grade.  This  imbalance  of  literature  is  perhaps 
reflective  of  the  general  movement  to  emphasize  formative  assess¬ 
ment  and  reflects  the  degree  of  difficulty  in  using  students  as  the 
judge  for  a  final  grade,  especially  when  students  are  very  young  or 
preparing  to  apply  for  college  (when  grades  count  the  most).  This 
finding  may  also  be  because  of  the  pervasive  practice  of  teacher- 
controlled  summative  results  in  the  kindergarten  through  12th 
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grade  classrooms.  Understandingly,  it  appears  to  be  difficult  for 
precollege  teachers  to  report  grades  judged  by  students  and/or  their 
peers  to  stakeholders  and  parents  because  of  concerns  about  how 
reliable  and  similar  these  may  be  to  teacher  judgments.  However, 
the  current  meta-analysis  suggested  that  fifth  through  12th  graders 
showed  relatively  similar  grading  outcomes  compared  with  teach¬ 
ers. 

Limitations 

Our  syntheses  are  confined  by  the  typical  limitations  implicit  in 
the  nature  of  meta-analyses.  Specifically,  our  analysis  was  limited 
by  the  level  of  completeness  and  specificity  of  reporting  found  in 
the  studies  that  we  identified  through  our  literature  search.  Issues 
of  statistical  power  restricted  our  investigation  of  many  modera¬ 
tors,  such  that  numbers  were  small  or  unequally  distributed  be¬ 
tween  moderator  variable  groupings. 

Also,  our  methods  would  have  been  strengthened  if  we  were 
able  to  use  intraclass  coefficients  to  control  for  clustering  effects 
within  classes.  However,  the  studies  reported  within  these  meta¬ 
analyses  did  not  report  intraclass  correlations,  and  estimated  values 
were  not  available  that  would  have  approximated  our  studies  based 
on  similar  sampling  strategies,  populations,  and  outcome  measures 
(Hedges  &  Rhoads,  2011). 

Future  Research 

A  more  complete  picture  of  SPG  is  needed  in  primary  and 
secondary  schools.  In  addition  to  the  research  directions  mentioned 
above,  future  research  might  investigate  the  development  of  SPG 
skills  in  kindergarten  through  the  third  grade.  We  understand, 
however,  that  SPG  at  these  grades  may  raise  concerns  about  the 
students’  emotional  reactions  and  its  impact  on  self-concept,  es¬ 
pecially  among  poorer  performers.  Additionally,  research  has  sug¬ 
gested  that  younger  children’s  ability  to  engage  in  SPG  may  be 
compromised  because  of  their  strong  emphasis  on  effort  when 
grading  (Nicholls,  Patashnick,  &  Mettetal,  1986;  Stipek  &  Tannatt, 
1984).  Thus,  we  recommend  SPG  in  young  children  with  great 
caution. 

Notably,  many  of  the  indicators  suggested  many  of  the  studies 
included  in  the  meta-analyses  had  design  weaknesses,  for  example, 
experimental  and  control  groups  were  often  drawn  from  different 
(nonequivalent)  schools  and  the  unit  of  assignment  rarely  was  the 
unit  of  statistical  analysis.  Although  the  occasional  use  of  random 
assignment  and  the  consistency  of  findings  is  encouraging,  these 
weaknesses  suggest  that  school-based  research  looking  at  SPG 
needs  more  rigorous  tests  of  effectiveness. 

In  addition,  these  meta-analyses  were  not  able  to  address  how 
student-level  variables  might  affect  the  degree  to  which  SPG  is 
beneficial.  For  example,  more  studies  are  needed  that  investigate 
how  a  student’s  level  of  ability  influences  the  effectiveness  of 
SPG.  Furthermore,  studies  that  address  the  influence  of  students’ 
grade  level  and  class  subject  matter  on  SPG  outcomes  are  critical 
to  establish  a  more  comprehensive  understanding  of  SPG  and 
achievement. 

Conclusion 

Our  meta-analyses  do  provide  fresh  data  and  insights  indicating 
the  following: 


•  Primary  and  secondary  students  demonstrate  enhanced 
learning  in  the  future  when  they  have  previously  self-  or 
peer-graded.  These  results  suggest  that  when  students  par¬ 
take  in  SPG,  they  may  develop  clearer  retention  and/or 
understanding  of  the  assessed  material. 

•  Students  can  self-grade  and  peer-grade  relatively  similarly 
to  teachers.  When  using  SPG  as  formative  evaluation,  the 
process  of  SPG  is  important  for  the  learning  experience 
and  the  actual  importance  of  the  grade  is  diminished. 

Thus,  SPG  could  effectively  be  implemented  more  frequently  in 
the  third  through  12th  grade  classrooms.  Self-grading  as  formative 
assessment  appears  to  be  the  most  favorable  way  to  practice  SPG 
in  the  classroom.  Additionally,  it  appears  that  teachers  (and  re¬ 
searchers)  understand  that  they  must  give  students  training  and 
support  in  order  for  students  to  benefit  from  the  procedures. 
Training  of  students  represents  a  significant  initial  time  commit¬ 
ment  on  the  part  of  the  teacher;  however,  any  new  curriculum 
endeavor  typically  requires  a  similar  initial  effort. 
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Four  Semesters  Investigating  Frequency  of  Testing,  the  Testing  Effect,  and 

Transfer  of  Training 

Donald  J.  Foss  and  Joseph  W.  Pirozzolo 

University  of  Houston 

We  earned  out  4  semester-long  studies  of  student  performance  in  a  college  research  methods  course  (total  N  = 

588).  Two  sections  of  it  were  taught  each  semester  with  systematic  and  controlled  differences  between  them. 

Key  manipulations  were  repeated  (with  some  variation)  across  the  4  terms,  allowing  assessment  of  replica¬ 
bility  ot  etfects.  Variables  studied  included  frequency  of  tests  (e.g.,  2  vs.  8  in-class  exams),  the  repetition  of 
some  and  not  other  exam  items  (i.e.,  the  testing  effect),  and  variation  of  test  items  between  the  in-class  exams 
and  the  final  exam  (e.g.,  identical  items  vs.  controlled  changes  in  items).  Some  studies  also  manipulated 
presence  or  absence  of  low-stakes  quizzes.  The  repetition  of  test  items  generally  led  to  better  performance. 

However,  we  did  not  observe  consistent  superiority  for  items  that  were  repeated  exactly  over  those  that  were 
repeated  in  modified  form;  the  reverse  was  more  often  the  case.  The  effect  of  the  low-stakes  quizzes  was 
minimal  at  best.  Results  are  discussed  in  terms  of  memory  and  transfer  of  training  models. 


Educational  Impact  and  Implications  Statement 

Can  we  find  inexpensive  and  easily  adaptable  modifications  to  teaching  methods  that  positively  impact 
student  outcomes?  These  studies  provide  a  positive  answer  to  that  question.  The  work  is  based  on 
laboratory  findings  that  frequent  tests  and  frequent  attempts  to  recall  the  same  material  (1)  aid  learning  and 
memoiy,  and  (2)  help  students  apply  what  they’ve  learned  to  new  problems.  The  present  studies  took 
place  in  large-enrollment  college  classes  across  four  semesters.  Within  each  semester  two  sections  of  an 
undergraduate  course  were  taught  in  a  highly  similar  fashion,  primarily  differing  in  the  number  of  tests 
given  and  whether  items  that  appeared  on  an  earlier  test  were  repeated  on  the  final  exam.  In  addition,  some 
of  the  repeated  items  were  identically  so,  while  other  ‘repeated’  items  tested  the  same  concepts  but  with 
different  wording.  We  found  evidence  that  frequent  testing  and  repetition  of  tested  items  can  improve 
course  performance  up  to  about  10%,  though  the  results  varied  across  the  studies  so  further  work  is  needed 
to  clarify  why.  We  also  observed  that  under  some  circumstances  students  did  as  well  or  even  better  on 
re-worded  test  items  as  they  did  when  the  item  was  repeated  in  exactly  the  same  words. 


Keywords:  testing  effect,  frequency  of  testing,  transfer  of  training,  college  learning 


Among  the  oft-studied  variables  that  affect  learning  and  reten¬ 
tion  are:  the  frequency  of  tests;  whether,  how  often,  and  in  what 
form  the  material  has  previously  been  tested,  that  is,  the  testing 
effect;  and  manipulations  that  affect  the  extent  to  which  learning 
transfers  to  new  test  environments.  The  present  work  assesses 
these  and  related  effects  across  four  entire  semesters  in  a  college 
course.  That  is,  it  presents  a  study  and  three  variations  (near 
replications)  of  it  using  highly  ecologically  valid  materials  and 
settings  in  both  high-stakes  and  (in  some  instances)  low-  or  no¬ 
stakes  testing.  One  major  motivation  for  these  studies  is  to  further 
determine  whether  the  laboratory-based  findings  associated  with 
these  variables  generalize  to  the  college  classroom  over  an  entire 


course.  Another  is  to  examine  some  boundary  conditions  on  their 
effectiveness  and  even  on  their  proposed  mechanisms. 

The  Testing  Effect 

The  testing  effect  has  a  substantial  history.  More  than  a  century 
ago  Myers  (1914)  reported  a  study  involving  seventh  and  eighth 
graders  who  were  given  a  surprise  recall  test  after  getting  a  list  of 
10  words  to  spell.  Some  students  who  got  a  delay ed-recall  test,  for 
example  an  hour  after  spelling  the  words,  had  also  been  given  an 
immediate  recall  test  without  feedback.  On  the  delayed  test  these 
students  recalled  more  items  than  those  who  had  not  been  given 
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the  immediate  recall  test.  Taking  the  first  test  helped  performance 
on  the  later  one:  hence,  “the  testing  effect.” 

Myers  also  tells  of  a  student  (it  was  a  different  time:  a  footnote 
informs  us  it  was  Miss  Margaret  Griffith)  who  had  recited  two 
prose  passages  before  the  College  Literary  Society,  one  464  words 
long  and  the  other  1,242  words.  He  asked  her  to  recite  the  longer 
one  to  him  once  each  week  for  7  weeks,  without  feedback  and'with 
no  further  study  or  practice.  At  the  end  of  that  time  she  recited  it 
perfectly.  He  then  asked  her  to  recall  the  shorter  passage  and 
reported  that  she  could  recall  less  than  half  of  it — a  case  study  of 
the  testing  effect.  Myers  (1914)  put  his  conclusion  succinctly: 
“Simple  recall  of  stimuli  wholly  or  partly  learned  aids  in  their 
retention”  (p.  128). 

Myers  was  clearly  onto  something.  While  its  prominence  has 
sometimes  waned,  as  evidenced  by  the  title  of  Glover’s  (1989) 
paper:  “The  Testing"  Phenomenon:  Not  Gone  But  Nearly  Forgot¬ 
ten,”  the  evidence  for  it  is  substantial  and  has  continued  to  grow 
(e.g.,  Dunlosky,  Rawson,  Marsh,  Nathan,  &  Willingham,  2013; 
Kang,  McDaniel,  &  Pashler,  2011;  Rawson  &  Dunlosky,  2013; 
Roediger,  &  Karpicke,  2006a;  Rowland,  2014).  Work  on  the 
testing  effect  has  been  further  extended  to  the  classroom,  including 
studies  using  grade  school  and  high  school  students,  and  testing 
materials  drawn  from  lessons  in  science  (e.g.,  McDaniel,  Agarwal, 
Huelser,  McDermott,  &  Roediger,  2011;  McDaniel,  Thomas, 
Agarwal,  McDermott,  &  Roediger,  2013),  history  (e.g.,  Carpenter, 
Pashler,  &  Cepeda,  2009;  McDermott,  Agarwal,  D’ Antonio,  Roe¬ 
diger,  &  McDaniel,  2014),  social  studies  (e.g.,  Roediger,  Agarwal, 
McDaniel,  &  McDermott,  2011),  and  others.  There  also  has  been 
some  research  with  college  (and  even  medical)  students  testing  the 
testing  effect  (e.g.,  Cranney,  Ahn,  McKinnon,  Morris,  &  Watts, 
2009;  Kromann,  Jensen,  &  Ringsted,  2009;  McDaniel,  Roediger, 
&  McDermott,  2007).  See  also  Roediger  and  Karpicke  (2006b). 

In  their  comprehensive  summary,  Dunlosky  et  al.  (2013)  say, 
“.  .  .  we  rate  practice  testing  as  having  high  utility”  (p.  35).  And  in 
the  Pashler  et  al.  (2007)  report  the  recommendation  to  apply  the 
testing  effect  within  the  nation’s  classrooms  (recommendation  5b) 
is  one  of  only  two  said  to  have  a  strong  level  of  evidence  in  its 
favor  (italics  in  original): 

5.  Use  quizzing  to  promote  learning.  Use  quizzing  with  active  re¬ 
trieval  of  information  at  all  phases  of  the  learning  process  to  exploit 
the  ability  of  retrieval  directly  to  facilitate  long-lasting  memory 
traces.  .  .  5b.  Use  quizzes  to  reexpose  students  to  key  content,  (p.  2) 

The  Frequency  of  Testing 

The  frequency  of  testing  also  has  a  substantial  history.  More 
than  80  years  ago,  Keys  (1934)  carried  out  an  early  study  inves¬ 
tigating  the  frequency  of  testing  using  the  complex  materials  from 
an  actual  college  course.  He  examined  the  effects  of  weekly  versus 
monthly  tests  on  retention  in  a  course  on  educational  psychology. 
To  do  so,  he  taught  two  sections  of  the  same  course,  in  the  same 
lecture  hall  at  the  University  of  California,  “  .  .  .  and  great  pains 
were  taken  to  keep  the  instruction  identical”  across  them  (Keys, 
1934,  p.  429).  To  quickly  crush  any  budding  sense  of  nostalgia  for 
the  cozy  classrooms  of  the  past,  we  learn  that  enrollment  in  the  two 
sections  totaled  three  hundred  sixty  students.  Keys  used  the  same 
test  items  in  each  section,  one  group  getting  more  frequent  and 
shorter  exams  than  the  other.  The  in-course  exams  comprised 


true-false  and  completion  items  in  the  ratio  of  7  to  1,  while  the 
final  exam  was  composed  of  100%  true-false  items. 

Aside  from  posting  grades,  no  feedback  was  provided  to  stu¬ 
dents  on  their  in-course  exam  performance,  although  persistent 
students  could  see  their  corrected  exams.  They  had  to  be  persistent 
because,  “the  time  and  place  were  intentionally  made  so  inconve¬ 
nient”  (Keys,  1934,  p.  430)  that,  on  average,  only  about  10%  of  the 
students  succeeding  in  seeing  an  exam.  (Again,  it  was  a  different 
time.) 

Students  in  the  weekly  exam  section  did  significantly  better  (by 
12%)  on  the  in-course  exams  themselves  than  did  students  in  the 
monthly  exam  section.  However,  this  may  have  been  due  to  the 
facts  that  the  weekly  exams  (a)  covered  less  material,  and  (b)  were 
administered  closer  to  the  time  the  course  material  was  presented. 
More  importantly,  Keys  actually  gave  two  “final”  exams  in  par¬ 
allel  forms:  one  surprise  exam  on  the  penultimate  class  day,  and 
two  weeks  later  the  announced  final.  On  the  unannounced  final  the 
weekly  exam  group  significantly  outperformed  the  monthly  exam 
group — they  differed  by  7%.  That  difference  is  unlikely  due  to  a 
difference  between  the  two  classes  in  the  time  between  presenta¬ 
tion  and  test  given  that  Keys  attempted  to  keep  the  instruction 
identical  between  them.  In  contrast,  there  was  no  difference  be¬ 
tween  the  weekly  and  the  monthly  groups  on  the  announced  final, 
which  Keys  suggests  might  have  been  due  to  “cramming.” 

Since  the  early  work  of  Keys  (1934)  there  have  been  hundreds 
of  studies  on  frequency  effects,  most  conducted  in  the  laboratory. 
Others  have  been  carried  out  with  relatively  simple  educational 
materials  (e.g.,  foreign  language  vocabulary),  and  following  Keys, 
a  few  others  have  employed  more  complex  course  content.  With 
some  exceptions  and  reservations  (e.g.,  Donovan  &  Radosevich, 
1999;  Ross  &  Henry,  1939),  most  investigations  have  found  that 
more  tests  lead  to  better  retention  (e.g.,  Bahrick,  Bahrick,  Bahrick, 
&  Bahrick,  1993;  Bangert-Drowns,  Kulik,  &  Kulik,  1991;  Cepeda, 
Pashler,  Vul,  Wixted,  &  Rohrer,  2006;  Cepeda,  Vul,  Rohrer, 
Wixted,  &  Pashler,  2008;  Gaynor  &  Millham,  1976;  Kika, 
McLaughlin,  &  Dixon,  1992;  Leeming,  2002).  Thus,  while  there  is 
no  doubt  the  effect  is  real,  some  investigators  (e.g.,  Carpenter, 
Cepeda,  Rohrer,  Kang,  &  Pashler,  2012;  Delaney,  Verkoeijen,  & 
Spirgel,  2010)  note  that  when  carefully  analyzed  the  apparently 
simple  pattern  of  results  can  actually  be  quite  complicated. 

The  “more  is  better”  generalization  has  been  refined,  for  exam¬ 
ple  by  Bangert-Drowns  et  al.  (1991)  whose  meta-analysis  of  fre¬ 
quent  classroom  testing  showed  that  with  each  successive  test  the 
additional  effect  size  shrinks.  They  also  made  the  useful  observa¬ 
tion  that  effect  size  differences  between  frequent  and  less  frequent 
numbers  of  tests  depend  upon  the  absolute  number  of  tests  in  the 
“less  frequent"  group,  with  the  difference  between  zero  and 
one  in-course  exam  being  quite  substantial.  Indeed,  the  difference 
between  “no  midterm”  and  “one  midterm”  was  so  great  that  the 
present  authors  considered  it  ethically  dubious  to  conduct  a 
classroom-based  experiment  at  the  limit:  That  is,  one  in  which  the 
less  frequent  group  has  no  midterm.  See  also  Cepeda  et  al.  (2008). 

To  date  there  has  been  a  relatively  small  (though  growing) 
subset  of  studies  that  followed  Keys  (1934)  by  carrying  out  the 
investigation  over  an  entire  course  or  substantial  parts  of  one  (see 
below).  A  number  of  questions  remain  about  the  conditions  under 
which  one  will  see  frequency  effects,  what  variables  affect  effect 
sizes,  what  interactions  might  be  practically  important,  whether 
and  what  types  of  test  materials  matter  and  why,  and  so  forth. 
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In  a  heroic  review  of  the  (probable)  practical  efficacy  of  10 
learning  techniques  derived  from  cognitive  and  educational  psy¬ 
chology  (Dunlosky  et  al.,  2013),  the  authors  give  high  marks  to 
the  likely  positive  influence  of  distributed  practice  (and  by  infer¬ 
ence  to  frequent  testing).  They  conclude  that,  “distributed  practice 
should  work  for  complex  materials  as  well.”  However,  in  what  we 
do  not  believe  is  a  proforma  statement,  they  quickly  add,  “Future 
research  should  examine  this  issue”  (Dunlosky  et  al.,  2013,  p.  40). 
Similarly,  in  a  report  sponsored  by  the  National  Center  for  Edu¬ 
cation  Research  (Pashler  et  al.,  2007),  the  first  recommendation, 
one  said  to  have  a  moderate  level  of  evidence  in  support  of  it,  is 
(italics  in  original):  “Space  learning  over  time.  Arrange  to  review 
key  elements  of  course  content  after  a  delay  of  several  weeks  to 
several  months  after  initial  presentation”  (p.  2).  Notable  for  our 
present  purpose,  the  authors  also  go  on  to  say,  “One  limitation  of 
the  literature  is  that  few  studies  have  examined  acquisition  of 
complex  bodies  of  structured  information”  (Pashler  et  al.,  2007,  p. 
6).  The  present  work  is  intended  to  help  limit  the  extent  of  that 
limitation. 

Testing  and  Transfer 

In  the  present  studies  we  addressed  another  question  of  both 
theoretical  and  practical  importance,  namely:  does  an  increase  in 
retrieval  probability  due  to  earlier  testing  require  presenting  the 
identical  question  on  subsequent  test(s),  and  if  not,  what  deter¬ 
mines  whether  the  student  will  recognize  that  the  question  is 
interrogating  the  same  conceptual  knowledge  and  will  thereby 
benefit  from  the  earlier  test?  The  literature  has  mixed  results  on 
this  topic  suggesting  that  there  may  not  be  a  simple  function 
relating  test  items  when  their  form  varies  across  two  or  more 
administrations. 

As  an  example  of  a  problematic  issue,  in  a  laboratory  experi¬ 
ment  using  a  chapter  from  a  biology  textbook,  Wooldridge,  Bugg, 
McDaniel,  and  Liu  (2014)  only  found  an  advantage  due  to  prior 
testing  when  the  final  test  items  were  both  based  on  factual 
materia]  and  identical  to  items  tested  in  the  first  presentation.  They 
found  no  advantage  for  items  that  required  applying  the  facts,  even 
when  the  items  were  identical  in  the  two  tests.  Similarly,  they  did 
not  see  a  testing  effect  when  the  item  on  the  second  test  was 
“related”  to  the  first  test  item,  whether  it  was  a  factual  or  an 
application  item. 

And  recently,  Nguyen  and  McDaniel  (2015)  report  on  a  labo¬ 
ratory  experiment  using  published  course  materials  such  as  test- 
bank  questions.  They  note  that  “when  quiz  and  test  items  are 
haphazardly  sampled”  the  type  and  degree  of  relationships  be¬ 
tween  those  items  vary.  In  that  case  they  “found  no  net  gain  on  a 
final  exam  for  students  who  took  the  quizzing  program  compared 
with  those  students  who  were  instructed  to  highlight  while  study¬ 
ing”  (Nguyen  and  McDaniel,  2015,  p.  89;  highlighting  being  a 
common  control  condition  in  laboratory  studies  on  the  testing 
effect).  The  overall  null  effect  was  apparently  due  to  observing  the 
expected  advantage  of  quizzing  when  the  items  tested  the  same 
concept  in  the  same  format,  and  observing  a  reverse  testing  effect 
when  the  quiz  items  asked  about  a  new  example  of  a  previously 
tested  concept.  They  dubbed  this  among  the  “ugly”  findings,  and 
cautioned  (p.  89)  that  if  items  “are  haphazardly  sampled,  teachers 
must  be  cautious  in  assuming  that  testing  will  confer  benefits  for 
exam  performance”  (Nguyen  and  McDaniel,  2015,  p.  89). 


Even  after  all  this  time,  then,  neither  the  testing  effect,  the 
frequency  of  testing  effect,  or  the  transfer  effect  stories  are  over. 
After  stipulating  their  enthusiasm  for  the  testing  and  frequency 
effects,  Dunlosky  and  Rawson  (2012)  go  on  to  say:  “Despite  the 
promise  of  these  techniques,  however,  further  research  is  needed  to 
more  firmly  establish  their  efficacy  in  the  classroom  and  to  dis¬ 
cover  how  they  can  best  be  used  to  ensure  robust  learning  and 
comprehension”  (p.  254). 

Testing  in  the  College  Classroom 

Our  current  work  examines  aspects  of  the  testing  effect,  the 
frequency  of  testing,  and  transfer  effects  in  situ.  We  carried  out  a 
series  of  studies  in  the  tradition  of  Keys  (1934)  and  a  few  others 
(e.g.,  Bangert-Drowns  et  al.,  1991;  Carpenter  et  al.,  2009;  Gaynor 
&  Millham,  1976;  Leeming,  2002;  Pennebaker,  Gosling,  &  Ferrell, 
2013);  that  is,  research  carried  out  in  classrooms  over  entire 
courses  (or  close  to  it,  e.g.,  Mawhinney,  Bostow,  Laws,  Blumen- 
feld,  &  Hopkins,  1971;  Roediger  et  al.,  201  la),  and  therefore  based 
upon  complex  and  interrelated  materials.  Such  an  approach  has  the 
disadvantage  of  complexity  in  terms  of  teasing  out  the  causes  of 
observed  effects.  But  it  also  has  the  advantage  of  being  an  exten¬ 
sion  and  “conceptual”  replication  of  laboratory  studies — as  well  as 
of  other  class-based  work;  and  of  having  a  relatively  short  gener¬ 
alization  path  to  applications  if  the  results  warrant. 

Though  we  have  not  mentioned  it  previously,  one  practical 
motivation  for  this  work  was  to  determine  whether  relatively  small 
changes  in  course  structure  can  lead  to  measureable  and  meaning¬ 
ful  changes  in  student  learning  and  performance.  If  so,  then  per¬ 
haps  that  can  help  convince  colleagues  to  embrace  those  changes. 
To  put  it  another  way,  we  are  interested  in  finding  evidence-based, 
inexpensive,  scalable,  and  easily  adoptable  and  adaptable  modifi¬ 
cations  to  teaching  methods  that  positively  impact  student  out¬ 
comes. 

Common  Framework  and  Methods  for  the  Research 

Over  each  of  four  consecutive  semesters  (not  including  summer 
courses)  the  senior  author  taught  two  sections  of  a  college  course 
called  Methods  in  Psychology,  a  required  course  for  both  psychol¬ 
ogy  majors  and  minors  at  the  University  of  Houston  (UH).  Though 
recommended  for  sophomores,  each  section  had  enrollees  from 
that  level  to  those  graduating  at  the  end  of  the  term.  The  course 
covered  an  introduction  to  scientific  methods  in  psychology,  nu¬ 
merous  aspects  of  experimental  design  and  interpretation,  a  basic 
introduction  to  descriptive  statistics  and  null-hypothesis  decision¬ 
making,  and  ethics  in  research.  It  also  introduced  related  topics 
such  as  base  rates  and  psychological  effects  on  decision-making. 

Within  the  constraints  of  the  various  test  schedules  described 
below,  and  following  Keys  (1934),  “great  pains  were  taken  to  keep 
the  instruction  identical”  between  the  lecture  sections  each  term — 
more  honestly,  to  keep  it  highly  similar  given  the  differences  in 
students’  questions  and  something  akin  to  the  “personality”  of 
each  class.  In  each  semester  the  students  in  the  two  lecture  sections 
saw  the  same  PowerPoint  material  (typically  also  made  available 
on  a  web  site  on  the  day  of  the  lecture  ±1),  were  given  the  same 
homework  problems  via  the  web  site,  were  presented  with  the 
same  “ripped  from  the  headlines”  topic  at  the  start  of  almost  every 
class  in  order  to  demonstrate  and  stimulate  discussion  of  the 
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ubiquity  and  content  of  claims  about  human  behavior  and  how 
they  might  be  tested,  heard  the  same  (what  the  instructor  liked  to 
consider)  jokes,  and  got  the  exact  same  exam  items  both  on  the 
in-class  midterm  exams  and  on  the  final  exam.  We  followed  the 
university’s  final  exam  schedule  each  semester,  which  typically 
meant  it  was  administered  about  a  week  to  10  days  after  the  last 
class  meeting. 

Early  in  all  sections  of  the  course  throughout  this  project  the 
instructor  emphasized  the  importance  of  active  learning,  in  gen¬ 
eral,  and  self-testing,  in  particular,  and  posted  that  advice  as  part  of 
an  exam  hint  document. 

For  practical  reasons  we  were  not  able  to  randomly  assign 
students  to  the  two  sections.  In  order  to  help  make  the  student 
characteristics  similar  across  them,  one  section  was  taught  at  10 
a.m.  and  the  other  at  11:30  a.m.  on  Tuesdays  and  Thursdays.  Each 
class  lasted  60  min.  Across  the  two  years  we  varied  whether  the  10 
a.m.  or  the  11:30  a.m.  class  was  in  the  “frequent  exam”  condition. 
As  will  be  further  described  below,  we  also  obtained  standardized 
test  and  GPA  information  about  each  student  so  that  we  could 
regress  out  certain  aptitude/accomplishment  scores  to  yield  more 
comparable  results  between  the  courses  each  semester. 

In  addition  to  a  lecture  section,  each  student  was  also  in  a 
smaller  “laboratory”  or  “recitation”  section  that  met  for  an  hour 
each  week.  These  meetings  focused  on  learning  to  use  a  university 
library,  American  Psychological  Association  (APA)  writing  style, 
and  developing  a  research  proposal.  Graduate  student  instructors 
taught  these  sections  following  a  common  syllabus.  Performance 
in  the  “lab  sections”  was  evaluated  separately  and  was  largely 
determined  by  a  paper  proposing  an  experimental  study.  Evalua¬ 
tions  in  those  sections  used  a  common  rubric,  and  we  made  an 
effort  to  have  similar  grading  standards  across  sections  in  any 
given  semester.  While  the  lab  grade  was  factored  into  the  student’s 
final  grade  in  the  course,  it  had  no  impact  on  the  scores  made  by 
students  on  the  in-class  midterms  and  final  exams  that  we  report 
here.  With  an  important  exception,  noted  below,  material  from  the 
lecture  section  was  not  covered  in  the  laboratory  meetings,  and 
therefore  we  will  report  only  on  data  drawn  from  the  lecture 
sections. 

Common  Methods,  Procedures,  and  Design 
of  Materials 

Although  there  were  substantial  conceptual  and  operational  simi¬ 
larities  across  the  semesters,  we  did  make  changes  as  we  proceeded — 
including,  in  the  second  year,  adding  a  potentially  important  vari¬ 
able  that  may  have  impacted  the  results,  and  making  a  significant 
procedural  change  in  Study  4.  Accordingly,  we  will  describe  the 
changes  made  in  each  study  in  the  Method  sections  below. 

In  each  of  the  four  studies  we  compared  student  performance 
across  two  sections,  one  of  which — the  “standard  testing”  class — 
had  two  midterms,  while  the  “frequent  testing”  class  was  given 
four  midterms  in  Study  1  and  eight  midterms  in  the  following  three 
studies.  The  students  were  not  aware  that  we  referred  to  them  as 
the  frequent  and  standard  classes;  each  student  had  access  to  the 
syllabus  for  his  or  her  respective  class  and  was  aware  of  how  many 
tests  would  occur  throughout  the  semester  for  that  class.  By  the 
end  of  the  semester  each  class  was  given  the  same  midterm  test 
items,  and  the  two  sections  each  took  the  same  final  exam.  After 
each  midterm  (in  one  of  the  next  two  class  meetings)  the  instructor 


went  over  the  exam  in  class  showing  the  keyed  correct  answer  and 
responded  to  questions.  In  addition,  students  were  invited  to  visit 
the  teaching  assistant  to  look  over  their  exam(s)  and  the  key.  They 
were  not,  however,  allowed  to  copy  the  exam.  In  these  large 
classes  only  a  small  minority  of  students  availed  themselves  of  this 
opportunity. 

The  simple  hypothesis  is  that  the  frequent  testing  class  will 
outperform  the  standard  testing  class  on  the  common  linal  exam, 
(as  well  as  on  the  total  points  earned  across  all  exams).  There  are 
a  number  of  reasons  for  this  simple  prediction.  For  example, 
frequent  testing  likely  leads  to  more  studying  as  well  as  more 
spaced  study  of  a  subset  of  the  materials.  And  that,  in  turn,  likely 
leads  frequently  tested  students  to  make  more  attempts  at  retriev¬ 
ing  relevant  information.  The  present  studies  were  not  primarily 
designed  to  examine  this  last  issue — though  they  do  look  at  a 
related  one. 

We  also  asked  whether  material  tested  on  one  of  the  midterm 
exams  would  lead  to  an  increase  in  retrieval  probability  for  that 
material  on  the  final  exam;  and,  if  such  an  advantage  exists, 
whether  it  is  necessary  that  the  later  test  present  the  item  in 
identical  format.  In  each  study  the  final  exam  repeated  some  items 
from  a  midterm  exam,  thereby  allowing  a  direct  test  of  the  testing 
effect  over  a  semester-long  course.  In  addition,  in  Studies  1,  2,  and 
3  we  also  systematically  varied  the  form  of  some  final  exam 
questions  relative  to  earlier  ones,  as  follows:  on  the  midterm  exams 
we  used  a  mixture  of  multiple-choice  (MC)  and  short-answer  (SA) 
items  (equal  numbers  of  each).  Two  forms  of  each  midterm  were 
developed:  the  items  that  appeared  in  MC  format  on  one  form  were 
in  SA  format  on  the  other.  Approximately  half  of  each  class  was 
consistently  given  each  form.  The  final  exam  also  had  two  forms 
built  on  the  same  principle:  SA  items  on  one  form  occurred  as  MC 
items  on  the  other.  Importantly,  on  the  final  exam  we  exactly 
repeated  a  subset  of  items  from  the  midterms  and  “flipped”  others 
such  that  if  it  appeared  in  MC  format  on  a  midterm  it  was  rewritten 
to  be  an  SA  item  on  the  final,  and  vice  versa.  We  will  dub  it  the 
MC-MC  condition  when  an  MC  item  from  one  of  the  midterms 
also  appeared  in  MC  format  on  the  final  exam,  and  the  MC-SA 
condition  when  an  MC  item  from  one  of  the  midterms  appeared  on 
the  final  exam  in  SA  format.  SA-SA,  and  SA-MC  refer  to  analo¬ 
gous  conditions  when  the  item  appeared  in  SA  format  on  one  of 
the  midterms.  The  flipped  items  allow  a  simple  test  of  whether  the 
retested  item  must  be  identical  in  form  to  receive  an  advantage  in 
a  course  context.  The  final  exam  also  included  items  not  previ¬ 
ously  tested. 

To  be  more  specific  about  the  form  of  the  final  exams  in  Studies 
1-3,  each  contained  64  items  of  which  32  (16  MC,  16  SA)  were 
new  to  the  students.  The  remaining  32  were  constructed  as  follows: 
Of  the  16  MC  items,  eight  were  exact  duplicates  of  items  from  an 
earlier  test:  four  from  the  exam(s)  given  in  the  first  half  of  the 
class,  and  four  from  the  exam(s)  given  in  the  second  half.  Thus, 
these  were  the  MC-MC  items.  The  oth^r  eight  MC  items  were 
Hipped  versions  of  SA  items  from  the  earlier  exams  (thus,  SA- 
MC):  again,  four  from  the  exam(s)  given  in  the  first  half  of  the 
class  and  four  from  the  second.  An  analogous  procedure  was  used 
with  the  remaining  16  SA  items,  eight  being  exact  duplicates  from 
earlier  tests  (SA-SA),  and  eight  being  flipped  versions  of  items 
previously  seen  in  MC  format  (MC-SA). 

In  summary ,  each  of  the  first  three  studies  examined  the  testing 
effect  (testing  itself  aids  learning)  and  the  frequency  of  testing  in 
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a  college  course.  As  well,  each  of  these  studies  examined  whether 
repeated  items  had  to  be  in  the  same  format  (MC  or  SA)  on  each 
administration  to  experience  the  advantage  of  prior  testing.  We 
expect  the  section  with  more  frequent  exams  to  do  better  than  the 
standard  two-exam  section,  the  repeated  items  (including  the 
flipped  ones)  to  do  better  on  the  final  than  nonrepeated  ones,  and 
to  observe  a  greater  advantage  for  the  exact  items  than  for  the 
flipped  ones.  The  fourth  study  tested  the  same  variables,  but 
manipulated  the  construct  associated  with  testing  the  transfer  ef¬ 
fect  in  a  different  way  (see  below). 

Common  aspects  of  the  participants.  Each  term  the  partici¬ 
pants  were  undergraduate  students  at  the  sophomore  level  and  above 
who  were  enrolled  in  one  of  two  sections  of  a  Methods  in  Psychology 
course  at  the  UH.  Nearly  all  of  them  took  the  course  because  it  is 
required  for  psychology  majors  and  minors.  At  this  time  UH  has 
among  the  most  diverse  college  student  populations  in  the  country 
(e.g.,  there  is  no  ethnic  majority  on  campus),  and  psychology  courses 
reflect  that  diversity.  A  preponderance  of  UH  psychology  majors  are 
females  (as  is  typical  across  the  United  States  at  this  time),  and  the 
same  is  true  for  participants  in  these  studies. 

In  addition  to  collecting  their  exam  scores,  we  obtained  for  each 
student  certain  standard  data  collected  by  the  university.  In  the  ideal 
case  those  would  be  the  same  data  for  each  participant,  but  at  UH 
some  students  present  with  SAT  scores,  fewer  with  ACT  scores,  and 
some  have  neither  because  in  this  state  a  student  can  transfer  to  a 
senior  level  institution  if  he  or  she  meets  certain  criteria,  for  example, 
has  successfully  completed  a  set  of  core  courses  at  a  community 
college  or  other  public  4-year  school.  The  UH  gets  a  lot  of  transfer 
students  and  does  not  collect  SAT  or  ACT  scores  from  most  of  them. 
We  also  obtained  cumulative  GPA  scores  for  each  student.  These 
GPAs  were  based  on  varying  numbers  of  courses — those  an  individ¬ 
ual  student  had  completed  at  UH  up  to  and  including  the  semester  in 
which  he  or  she  took  the  class. 

Because  we  could  not  randomly  assign  students  to  class  sec¬ 
tions,  we  constructed  an  “aptitude”  score  for  each  participant  based 
upon  the  standardized  test  and  GPA  information  we  obtained.  For 
each  of  the  following  on  which  an  individual  had  a  score,  SAT 
Math  and  Critical  Thinking,  ACT,  and  UH  GPA,  we  computed  the 
student’s  standard  score  relative  to  all  students  in  all  eight  sections 
across  the  two  years.  If  a  student  had  a  score  on  both  SAT 
measures,  we  averaged  those  standard  scores  to  get  one  for  SAT. 
Then  we  averaged  the  standard  scores  to  obtain  a  single  value  for 
each  participant’s  aptitude. 

Over  the  course  of  this  work  we  made  three  possibly  significant 
changes  in  methods,  and  a  few  that  we  consider  minor  ones.  These 
will  be  described  below. 

Common  analyses.  In  each  of  the  following  four  studies,  we 
conducted  two  main  analyses.  First,  to  investigate  the  effects  of 
prior  testing  on  final  exam  performance,  a  repeated  measures 
analysis  of  covariance  (ANCOVA;  SAS  9.4)  was  performed  using 
the  mean  score  on  each  of  the  six  types  of  final  exam  items  for 
each  participant  as  the  dependent  variable  (for  clarification,  these 
six  types  of  items  were  MC-MC,  SA-MC,  SA-SA,  MC-SA,  new 
MC,  and  new  SA  in  Studies  1-3;  a  similar  design  using  six  types 
of  items  was  used  in  Study  IV).  Although  the  factors  included  in 
the  models  vary  across  the  four  studies,  each  analysis  included  the 
following  factors:  student  aptitude,  testing  frequency,  and  item 
type.  This  analysis  provides  tests  of  the  effect  of  repetition  (the 
testing  effect),  transfer  of  learning  (performance  on  flipped  and 


new  items  in  Studies  1-3,  and  related  and  new  items  in  Study  4), 
and  potential  interactions  between  testing  frequency  and  item  type. 
We  report  results  from  fixed  effects  and  orthogonal  contrasts  in 
this  analysis  to  evaluate  our  specific  hypotheses  in  each  study.  In 
Studies  1-3  orthogonal  contrasts  are  used  to  test  (a)  whether 
students  perform  better  on  repeated  (MC-MC,  SA-SA,  SA-MC, 
and  MC-SA)  than  “new”  (new  MC  and  new  SA)  items,  and  (b) 
whether  students  perform  better  on  “exact”  (MC-MC  and  SA-SA) 
than  flipped  (SA-MC  and  MC-SA)  items. 

Second,  to  perform  a  more  direct  test  of  the  effect  of  testing 
frequency,  an  ANCOVA  (SAS  9.4)  on  the  final  exam  score  was 
used.  Factors  included  in  these  models  also  varied  across  the 
studies;  however,  in  each  study  the  model  included  student  apti¬ 
tude  and  testing  frequency. 

Study  1 

To  recap,  Study  1  was  designed  to  test  whether  students  in  a  college 
course  perform  better  on  final  exam  items  taken  during  a  previous 
exam  and  whether  taking  more  frequent  tests  benefits  subsequent  final 
exam  performance.  Students  in  two  sections  of  a  college  Methods 
course  either  took  two  or  four  course  exams  during  a  15-week 
semester,  and  then  took  the  same  comprehensive  final  exam.  Items  on 
the  final  exam  either  (a)  appeared  exactly  as  taken  during  one  of  the 
course  exams,  or  (b)  asked  the  same  question  as  a  previous  item  but 
in  a  different  form  (MC  or  SA),  or  (c)  were  not  previously  tested  in 
any  form.  We  predicted  that  students  would  perform  better  on  items 
that  they  had  previously  seen  (in  any  form)  than  on  new  items,  and 
better  on  exact  items  than  flipped  ones.  We  also  predicted  that 
students  in  the  more  frequently  tested  class  would  perform  better  on 
the  final  exam  overall. 

Method 

Participants.  There  were  75  students  in  the  standard  class  and 
84  in  the  frequently  tested  one. 

Materials  and  procedure.  Two  sections  of  the  Methods  in 
Psychology  three-credit  course  were  taught.  The  standard  testing 
section  (two  midterms  and  two  pop  quizzes)  was  offered  at  10  a.m. 
on  Tuesdays  and  Thursdays,  while  the  frequent  testing  section 
(four  midterms,  four  pop  quizzes)  took  place  at  1 1 :30  a.m.  on  those 
same  days.  There  were  29  class  meetings  across  15  weeks. 

The  two  midterms  in  the  standard  class  took  place  in  Weeks  6 
and  12.  Each  exam  contained  20  items:  half  MC  and  half  SA.  The 
four  midterms  in  the  frequent  class  took  place  in  Weeks  3,  5,  9,  and 
12.  Each  of  those  exams  had  10  items,  half  MC  and  half  SA.  In 
total,  each  class  was  given  the  same  40  midterm  items.  Two  forms 
of  the  test  were  used  in  each  class  as  described  earlier. 

The  pop  quizzes  contained  three  to  five  MC  items  and  were 
unannounced.  They  were  included  to  stimulate  and  reward  atten¬ 
dance.  Questions  were  presented  via  PowerPoint  slides,  with  students 
marking  and  then  turning  in  answer  sheets.  After  that,  the  instructor 
presented  the  correct  answer  for  each  item.  Pop  quizzes  in  the  fre¬ 
quent  class  occurred  in  Weeks  2,  5,  7,  and  14;  while  in  the  standard 
class  they  were  given  in  Weeks  5  and  14.  The  final  exam  was 
constructed  as  described  above  in  the  Common  Methods  section. 

In  Study  1  we  encountered  missing  data  on  item-by-item  per¬ 
formance  on  the  final  exam  for  39  participants.  However,  we 
retained  the  total  final  exam  score  for  each  of  these  participants. 
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For  this  reason,  the  item  type  analysis  uses  39  fewer  participants 
than  the  final  exam  score  analysis. 

Results 

Results  from  a  repeated  measures  ANCOVA  revealed  signifi¬ 
cant  main  effects  of  student  aptitude,  F(l,  117)  =  16.84,  p  < 
.0001,  and  item  type,  F( 5,  117)  =  45.51,  p  <  .0001.  The  main 
effect  of  testing  frequency,  F(l,  117)  =  0.63,  p  =  .43,  and  the 
interaction  of  testing  frequency  and  item  type,  F( 5,  1 17)  =  1.72, 
p  =  .14,  were  not  significant  predictors  of  final  exam  performance. 
A  specific  contrast  revealed  that  students  performed  significantly 
better  on  repeated,  that  is,  both  exact  and  flipped  items  (least 
squares  M  =  .72;  a  model  adjusted  mean,  henceforth  “LS  M")  than 
on  new  items  (LS  M  =  .63),  F(l,  117)  =  57.59,  p  <  .0001.  This 
exact  versus  new  item  contrast  yield  a  Cohen’s  d  effect  size  of 
.60. 1  No  significant  difference  was  found  between  exact  (LS  M  = 
.71)  and  flipped  items  (LS  M  —  .70;  d  =  .15).  Figure  1  shows 
performance  on  the  final  exam  for  the  standard  and  frequent 
classes  by  item  type. 

An  additional  analysis  to  test  the  effect  of  testing  frequency  on 
final  exam  scores  (excluding  item  characteristics)  again  found  a 
significant  effect  of  student  aptitude,  F(l,  155)  =  68.12,  p  < 
.0001.  The  effect  of  testing  frequency  was  not  significant,  F(l, 
155)  =  .11,  p  —  .74;  comparable  performance  was  observed 
between  the  frequent  (LS  M  =  .66)  and  standard  (LS  M  =  .65) 
classes  (d  =  .07).  The  interaction  between  testing  frequency  and 
student  aptitude,  F(l,  155)  =  .37,  p  =  .55  was  not  significant. 

Discussion 

Students  performed  significantly  better  on  items  that  had  ap¬ 
peared  on  a  previous  exam.  Interestingly,  this  repetition  effect  did 
not  depend  on  whether  the  repeated  items  were  in  the  same  form 
as  previously  seen,  as  evidenced  by  the  finding  that  students  did 
not  perform  significantly  better  on  exact  items  than  on  flipped 
items.  This  finding  differs  from  that  of  Nguyen  and  McDaniel 
(2015),  though  it  replicates  other  results  in  the  literature  (Bjork, 
Little,  &  Storm,  2014)  and  suggests  that  testing  during  college 
exams  may  facilitate  a  near  form  of  transfer  of  learning. 

The  effect  of  testing  frequency  was  not  significant  in  Study  1, 
contrary  to  our  hypotheses.  Over  an  entire  semester,  the  increase 
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Figure  1.  Least  squares  means  (proportion  correct)  on  the  final  exam  in 
Study  1  as  a  function  of  testing  frequency  and  item  type.  Error  bars 
represent  standard  errors  of  the  means. 


from  two  to  four  exams  may  not  be  enough  to  make  a  difference 
in  exam  performance.  So  one  reason  for  Study  2  was  to  increase 
the  number  of  tests  in  the  frequent  condition.  We  also  modestly 
increased  the  number  of  midterm  exam  items  in  order  to  sample 
topics  more  broadly  and  to  give  some  additional  power  to  the 
comparison  of  item  types  (e.g.,  identical  vs.  flipped). 

Study  2 

Study  2  was  designed  to  see  whether  increasing  the  number  of 
in-course  exams  would  replicate  the  direction  of  findings  from  Study 

1  and  to  further  examine  the  effect  sizes.  In  this  semester  (and  in 
Studies  3  and  4)  the  frequent  testing  group  received  eight  short 
in-course  exams  while  the  standard  testing  group  again  got  two. 
Increasing  the  number  of  announced  exams  almost  certainly  increases 
the  number  of  times  the  typical  student  reviews  the  material,  and 
likely  increases  the  number  of  times  that  students  make  relevant 
retrieval  efforts  while  preparing  for  and  taking  those  exams.  And,  of 
course,  the  shorter,  more  frequent  exams  generally  mean  that  the 
amount  of  material  to  be  reviewed  is  less.  However,  in  this  study  the 
students  were  informed  that  later  exams  could  ask  about  material  from 
earlier  in  the  course — and  they  were  reminded  that  the  final  exam  was 
comprehensive.  In  fact,  a  small  number  of  items  on  the  later  in-course 
exams  did  ask  about  material  covered  earlier. 

Method 

Participants.  Class  sizes  were  considerably  smaller  in  Study 
2:  there  were  34  students  in  the  standard  exam  class  and  36  in  the 
frequently  tested  one. 

Materials  and  procedure.  In  Study  2  the  frequent  testing 
section  took  place  at  10:00  a.m.  on  Tuesdays  and  Thursdays,  while 
the  standard  testing  section  was  offered  at  11:30  a.m.  those  same 
days.  There  were  28  class  meetings  across  14  weeks,  with  a 
1-week  break  after  the  eighth  week. 

The  two  midterms  in  the  standard  class  occurred  in  Weeks  6  and 
12.  Each  of  these  exams  contained  24  items,  half  MC  and  half  SA. 
The  eight  in-course  exams  in  the  frequent  class  took  place  in 
Weeks  2,  3,  4,  5,  7,  8,  9,  and  11.  Each  of  those  exams  had  six 
items,  half  MC  and  half  SA.  Overall,  both  classes  were  given  the 
same  48  midterm  items.  The  final  exam  was  created  as  described 
above.  No  pop  quizzes  were  given  in  this  semester. 

We  report  results  for  students  who  finished  the  course  and  who 
took  both  midterms  if  in  the  standard  class,  and  who  took  at  least 
seven  of  the  eight  exams  if  in  the  frequent  class.  For  the  three 
students  who  missed  one  of  the  eight,  we  assigned  their  average 
score  on  the  remaining  seven  to  the  missing  score  for  the  purpose 
of  calculating  a  fair  cumulative  score.  In  Study  2  we  used  analo¬ 
gous  analytic  strategies  to  those  performed  in  Study  1. 

Results  * 

The  LS  means  on  proportion  correct  are  shown  in  Figure  2.  Study 

2  provided  similar  results  to  those  in  Study  1 .  An  analysis  of  final 
exam  performance  on  the  various  types  of  items  showed  significant 


Cohen  s  d  for  the  final  exam  outcome  was  calculated  by  dividing  the 
difference  in  LS  means  between  the  two  groups  of  interest  by  the  pooled 
within-group  standard  deviation. 
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Figure  2 .-  Least  squares  means  (proportion  correct)  on  the  final  exam  in 
Study  2  as  a  function  of  testing  frequency  and  item  type.  Error  bars 
represent  standard  errors  of  the  means. 


effects  of  student  aptitude,  F(l,  67)  =  30.43,  p  <  .0001,  and  item 
type,  F{ 5,  67)  =  32.26,  p  <  .0001.  The  effect  of  testing  frequency, 
F(l,  67)  —  1.87,  p  =  .15,  was  not  statistically  significant,  though  it 
appeared  to  be  stronger  in  Study  2  than  Study  1  despite  fewer 
participants.  Interestingly,  we  observed  a  significant  interaction  be¬ 
tween  testing  frequency  and  item  type,  F( 5,  67)  =  3.92 ,p  =  .004  (see 
Figure  2).  As  was  also  shown  in  Study  1,  students  performed  better  on 
repeated  items  (LS  M  —  .71)  than  on  new  ones  (LS  M  =  .68;  d  = 
.20),  F(l,  67)  =  7.83,  p  =  .0007;  while  the  performance  difference  on 
exact  (LS  M  =  .71)  versus  flipped  items  (LS  M  —  .71;  d  =  —  ,05)2 
was  not  significant,  F(l,  67)  =  .25,  p  =  .62. 

An  ANCOVA  on  the  final  exam  score  showed  a  significant  effect 
of  student  aptitude,  F(l,  66)  =  44.82,  p  <  .0001,  but  no  significant 
effect  of  testing  frequency  (frequent  LS  M  =  .71;  standard  LS  M  = 
.68;  d  =  .20),  F(l,  66)  =  2.12,  p  =  .15,  or  the  interaction  between 
testing  frequency  and  student  aptitude,  F(l,  66)  =  .56,  p  =  .46. 

Discussion 

In  Study  2  we  again  found  evidence  for  the  testing  effect:  students 
performed  better  on  repeated  items  than  on  new  ones.  And,  as  also 
found  in  Study  1,  the  repetition  effect  did  not  differ  between  items  that 
were  identical  to  those  from  a  previous  exam  and  those  that  were 
flipped.  While  there  was  not  an  overall  effect  for  test  frequency,  we 
note  that  frequently  tested  students  were  correct  on  a  higher  propor¬ 
tion  of  items  in  five  of  the  six  item  types,  including  the  new  items  that 
they  had  not  previously  seen.  The  only  major  change  in  design  from 
Study  1  to  Study  2  was  the  increase  in  number  of  exams  for  the 
frequent  class  from  four  exams  to  eight.  As  noted,  although  the  effect 
of  testing  frequency  was  not  significant  in  Study  2,  it  trended  stronger 
than  in  Study  1  even  though  the  sample  size  (and  thus  the  power  of  the 
comparison)  was  substantially  smaller. 

Study  3 


Method 

In  Study  3  the  basic  testing  frequency  manipulation  was  again 
two  versus  eight  in-class  exams.  However,  in  this  semester  we 
added  a  new  component  to  the  manipulation  in  order  to  examine 


and  compare  the  effects  of  low-stakes  as  well  as  high-stakes 
testing.  Roediger  et  al.  (2011b)  report  that  sixth  grade  students 
improved  their  performance  on  graded  free-recall  exams  after 
receiving  low-stakes  MC  quizzes  two  days  earlier.  On  an  end-of- 
semester  MC  test  the  students  also  performed  better  on  previously 
quizzed  items  than  on  nonquizzed  ones.  In  their  study  low  stakes 
was  equivalent  to  no  stakes  because  no  grade  depended  upon  the 
quiz  performance. 

Study  3  builds  on  this  work  by  presenting  a  subset  of  students 
with  no-stakes  quizzes  over  a  semester.  We  should  quickly  note 
that  Study  3  does  not  use  their  manipulation  nor  directly  address 
the  testing  effect  because  none  of  the  no-stakes  quiz  items  ap¬ 
peared  later  on  the  final  exam.  In  this  study  the  quiz  dates  were  not 
announced  in  advance  so  students  in  the  quiz  sections  could  not 
specifically  prepare  for  them.  However,  we  presume  that  they  did 
make  additional  efforts  to  recall  aspects  of  the  course  material 
during  the  quizzes  themselves.  Is  that  enough  to  produce  better 
performance  on  the  final?  One  reason  it  might  be  is  that,  unlike 
studies  involving  lists,  a  course  involves  interrelated  material  such 
that  both  reading  about  and  making  retrieval  attempts  for  Concept 
A  may  also  help  the  later  retrieval  of  a  (related)  Concept  B. 

Participants.  Seventy  students  were  in  the  standard  class,  of 
whom  44  received  the  low-stakes  quizzes  and  26  did  not.  Seventy- 
five  students  were  in  the  frequent  class;  of  those,  47  were  admin¬ 
istered  the  quizzes  and  28  were  not. 

Materials  and  procedure.  In  the  current  study  we  took  ad¬ 
vantage  of  the  eight  separate  laboratory  sections  to  present  a  subset 
of  students  from  each  lecture  section  with  a  series  of  six  short 
quizzes  spaced  across  the  semester.  Two  of  the  five  lab  sections  in 
the  standard  class,  and  two  of  the  three  in  the  frequent  class,  were 
arbitrarily  chosen  to  administer  the  quizzes,  which  were  given  in  a 
“pop  quiz”  fashion — that  is,  unannounced.  Each  of  the  six  quizzes 
contained  six  items,  three  MC  and  three  SA.  They  were  presented 
in  PowerPoint  format,  and  the  instructor  gave  feedback  on  the 
correct  answers  immediately  after  collecting  the  answer  sheets. 
While  the  quiz  items  asked  about  lecture  material,  they  did  not 
duplicate  any  actual  exam  items.  Students  in  the  quiz  sections  were 
informed  that  the  quizzes,  while  collected,  were  not  going  to  be 
graded — that  they  were  just  for  practice  and  for  their  benefit.  They 
were,  however,  told  that  they  would  receive  extra  credit  for  being 
present  and  taking  the  quizzes. 

The  quizzes  were  administered  in  lab  sections  prior  to  Exams  1, 
2,  3,  5,  7,  and  8  for  the  frequent  exam  group;  and  three  of  the  six 
quizzes  were  given  prior  to  Exam  1  for  the  standard  group. 

To  summarize  this  change,  the  addition  of  the  quiz  variable 
meant  that  we  had  four  between-subjects  groups  in  Study  3  (and 
similarly  in  Study  4)  determined  by  crossing  the  frequency  of 
midterm  variable  with  the  quiz  versus  no-quiz  variable.  The  mid¬ 
terms  were  high-  stakes  exams — performance  on  them  affected  the 
students’  grades — while  the  quizzes  were  low-  (or  no-)  stakes 
because  quiz  performance  had  no  effect  on  grades.  This  2X2 
design  led  to  cells  with  2  (high  stakes),  8  (high  stakes),  8  (2  high 
stakes,  6  no  stakes),  and  14  (8  high  stakes,  6  no  stakes)  entries.  The 
quiz  items  themselves  were  drawn  from  the  course  material,  but 
none  were  given  on  the  exams  in  the  lecture  section.  Thus,  the 


2  We  report  a  negative  effect  size  when  the  observed  difference  was 
opposite  to  the  hypothesized  effect. 
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quizzes  added  some  possibly  relevant  additional  tests,  but  did  not 
contribute  to  our  study  of  the  testing  (repetition)  effect.  Student 
quiz  participation  statistics  (for  students  in  quiz  sections)  in  Stud¬ 
ies  3  and  4  are  shown  in  Table  1.  Though  students  took  a  higher 
percentage  of  the  quizzes  in  Study  3  as  compared  with  Study  4,  in 
both  studies  a  large  percentage  of  students  took  at  least  three  or 
more  quizzes  (approximately  90%  in  Study  3  and  60%  in  Study  4). 

Study  3  also  modified  the  composition  of  the  exam  items  in 
order  to  allow  cleaner  and  more  interpretable  comparisons  of  the 
testing  effect  across  item  types.  Consider  the  comparison  on  the 
final  exam  between  MC-MC  items  and  SA-MC  items — that  is, 
when  we  compare  performance  of  MC  items  on  the  final  exam 
when  preceded  either  by  (a)  the  identical  MC  item  on  a  midterm, 
or  (b)  by  the  same  item  presented  in  SA  form  on  the  midterm,  that 
is,  the  flipped  items.  Call  them  MC(  and  MCF,  respectively.  In 
order  to  tighten  this  comparison,  we  examined  the  performance  on 
final  exam  items  in  the  previous  semesters  and  constructed  the 
final  exam  in  Study  3  such  that  prior  performance  on  MCt  and 
MCF  were  equated  (i.e.,  within  ±1%  in  each  pair).  Any  perfor¬ 
mance  difference  due  to  inherent  item  difficulty  is  thereby  con¬ 
trolled.  We  carried  out  an  analogous  process  for  SA-SA  and 
MC-SA  items  such  that  the  SAj  and  SAF  items  were  equated 
within  ±4%  in  each  pair. 

In  this  study  the  standard  testing  section  was  offered  at  10:00 
a.m.  on  Tuesdays  and  Thursdays,  while  the  frequent  testing  section 
took  place  at  11:30  a.m.  on  those  same  days.  There  were  29  class 
meetings  across  15  weeks. 

The  two  midterms  in  the  standard  class  took  place  in  Weeks  6 
and  12.  Each  contained  24  items,  half  MC  and  half  SA.  The  eight 
in-course  exams  in  the  frequent  class  took  place  in  Weeks  2,  4,  5, 
7,  8,  9,  11,  and  13.  Each  had  six  items:  half  MC  and  half  SA. 
Overall,  both  the  standard  and  the  frequent  classes  were  given  the 
same  48  midterm  items.  Final  exams  were  constructed  as  described 
earlier.  Analyses  performed  in  Study  3  were  comparable  to  anal¬ 
yses  performed  in  previous  studies,  while  including  “quiz  status” 
as  a  term  in  our  model  to  test  the  effect  of  taking  no-stakes  quizzes 
and  its  interaction  with  other  factors. 

Results 

The  pattern  of  results  on  the  final  exam  for  Study  3  is  shown  in 
Figure  3.  The  analysis  of  item  performance  revealed  significant 
effects  of  student  aptitude,  F(  1,  140)  =  65.32,  p  <  .0001,  item 
type,  F(5,  140)  =  41.37,  p  <  .0001,  and  a  significant  interaction 
between  testing  frequency  and  item  type,  F(5,  140)  =  3.17,  p  < 
.01.  Testing  frequency,  F(l,  140)=  1; 72,  p  —  .19,  quiz  status,  F(l, 

Table  1 


Mean  Quizzes  Taken  and  Percentage  of  Students  Taking  Each 
Quiz  in  Studies  3  and  4 


Number  of  quizzes  taken 

Study  3 

Study  4 

Mean  quizzes  taken 

4.33 

3.26 

One  quiz  (%) 

4 

18 

Two  quizzes  (%) 

7 

22 

Three  quizzes  (%) 

17 

15 

Four  quizzes  (%) 

24 

20 

Five  quizzes  (%) 

20 

9 

Six  quizzes  (%) 

28 

16 

0.9  . 

■  Frequent 


MC-MC  SA-MC  New  MC  SA-SA  MC-SA  New  SA 


Figure  3.  Least  squares  means  (proportion  correct)  on  the  final  exam  in 
Study  3  as  a  function  of  testing  frequency  and  item  type.  Error  bars 
represent  standard  errors  of  the  means. 

140)  =  .13,  p  =  .72,  and  the  interactions  between  testing  fre¬ 
quency  and  quiz  status,  F(l,  140)  =  1.19,  p  =  .28,  and  between 
quiz  status  and  item  type,  F( 5,  140)  =  .12,  p  =  .99,  were  not 
significant.  We  again  found  that  students  performed  better  on 
repeated  (LS  M  =  .69)  than  new  items  (LS  M  —  .62;  d  =  .39),  F(l, 
140)  =  53.16,/?  <  .0001.  Contrary  to  our  hypothesis  we  found  that 
students  performed  better  on  flipped  items  (LS  M  =  .72)  than 
exact  items  (LS  M  =  .66;  d  =  —.33),  F(l,  140)  =  22.14,  p  < 
.0001. 

The  LS  means  for  each  combination  of  quiz  status  (0,  6)  and 
exam  frequency  (S  =  standard  =  2;  F  =  frequent  =  8)  from  Study 
3  are  shown  in  Table  2.  When  the  results  of  the  final  exam  are 
analyzed,  the  overall  effect  due  to  the  frequency  of  in-class  exams 
(frequent  LS  M  =  .68;  standard  LS  M  =  .63;  d  =  .28)  was 
statistically  significant  F(l,  140)  =  4.12,  p  =  .04.  In  contrast,  on 
the  final  exam  the  main  effect  of  quiz  status  (quiz  LS  M  =  .65;  LS 
M  =  .66;  d  =  —.06),  F(l,  140)  =  .13,/?  =  .72,  and  the  interaction 
of  testing  frequency  and  quiz,  F(l,  140)  =  .54,  p  =  .46,  was  not 
significant. 

Discussion 

Study  3  again  found  evidence  consistent  with  the  testing  effect 
in  a  college  course:  previous  exposure  to  exam  items  improved 
performance  on  the  final  exam.  It  also  found  significantly  superior 
performance  of  the  frequent  versus  the  standard  class  on  the  final 
exam.  An  inspection  of  these  data  hints  that  the  difference  between 
the  frequent  and  standard  exam  schedules  may  be  greater  for  SA 
items  that  for  MC  items,  suggesting  that  frequent  testing  improves 
recall  memory  performance  more  than  recognition  memory.  When 
inspecting  Figure  3  it  is  evident  that,  although  frequent  and  stan¬ 
dard  classes  do  not  differ  on  final  exam  MC  items  (MC-MC, 
SA-MC,  and  new  MC),  the  frequent  cla$s  consistently  outperforms 
the  standard  class  on  SA  items  (SA-SA,  MC-SA,  and  new  SA). 
That  led  us  to  make  a  methodological  change  in  Study  4. 

In  contrast,  there  was  no  hint  that  taking  the  low-stakes  quizzes 
led  to  an  increase  in  exam  performance.  We  mentioned  earlier 
some  aspects  of  our  design  that  might  have  militated  against 
observing  such  an  effect.  These  include  its  no-stakes  feature  and 
the  fact  that  they  were  unannounced.  Perhaps  most  importantly, 
the  quiz  items  were  not  identical  with  final  exam  items,  nor  did 
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Table  2 

LS  Means  (SD)  on  Cumulative  Score  by  Testing  Frequency  and 
Quiz  Condition  for  Study  3 


Quiz  condition 

Frequent 

Standard 

Mean 

Quiz 

.69  (±.17) 

.62  (±.18) 

.65 

No  quiz 

.68  (±.18) 

.65  (±.13) 

.66 

Mean 

.68 

.63 

Note.  Table  shows  the  least  squares  (LS)  means  for  cumulative  score  in 
each  cell  of  the  Testing  Frequency  X  Quiz  interaction  in  Study  3. 


they  as  closely  tap  the  same  conceptual  information  as  did,  for 
example,  the  flipped  items  on  the  exams.  However,  it  seems  highly 
likely  that  students  did  make  recall  efforts  on  the  quizzes,  and  that 
when  presented  feedback  about  the  correct  responses  when  fin¬ 
ished  with  a  quiz  they  would  have  made  an  attempt  to  update 
information  about  the  course  material.  Even  so,  we  found  no 
evidence  that  students  generalized  anything  they  may  have  learned 
from  the  quizzes  to  the  items  on  the  final  exam. 

Study  4 

Study  3  found  more  consistent  evidence  for  the  testing  effect 
among  the  MC  items  than  the  SA  items.  That  is,  both  the  MC-MC 
and  SA-MC  conditions  were  superior  to  the  new  MC  condition, 
while  the  SA-SA  condition  was  nearly  identical  to  the  new  SA 
condition.  Also,  we  observed  a  somewhat  larger  frequency  effect 
overall  for  SA  items  than  for  MC  ones.  In  Study  4  we  omitted  MC 
items  to  examine  whether  the  repetition  and  (especially)  the  fre¬ 
quency  effects  would  be  amplified  if  all  test  items  were  presented 
in  the  more  difficult  recall  format,  one  that  may  require  more 
robust  retrieval  efforts.  (In  this  study  we  also  added  8  items  at  the 
end  of  the  final  exam  to  explore  another  hypothesis.  We  will  not 
report  those  data  here.) 

Study  4  also  included  no-stakes  SA  quizzes  given  in  some  of  the 
lab  sections.  We  conjectured  that  there  would  be  some  effect  from 
the  no-stakes  quizzing  done  in  the  labs,  again  due  to  the  more 
process-intense  activities  needed  to  respond  to  the  SA  quiz  items. 
Finally,  in  Study  4  the  students  in  the  quiz  sections  were  told  in 
advance  when  the  quizzes  would  occur.  Thus  it  was  possible  for 
them  to  prepare  for  the  quizzes  (even  though  they  were  told  that 
the  quizzes  did  not  count  toward  their  course  grade). 

With  respect  to  the  testing  effect.  Study  4  also  systematically 
manipulated  the  relationship  between  midterm  items  and  final  exam 
items  in  a  new  way.  Rather  than  changing  items  from  MC  to  SA 
format  and  vice  versa,  in  Study  4  we  varied  the  degree  of  relationship 
between  certain  items  on  a  midterm  and  on  the  final.  Three  types  of 
relationships  were  used:  (a)  some  final  exam  items  were  exact  dupli¬ 
cates  of  midterm  items,  (b)  some  were  new  items,  and  (c)  yet  others 
were  related  to  specific  midterm  items  in  that  they  required  the  same 
conceptual  knowledge  to  answer  but  were  phrased  differently.  Each 
of  these  three  item  types  was  further  divided  into  two  subsets.  One 
subset  of  questions  required  students  to  recall  facts  or  definitions  from 
the  course;  the  other  subset  required  students  to  apply  their  knowledge 
to  a  new  problem  or  setting.  There  were  eight  instances  of  each  of 
these  six  subtypes  on  the  final.  The  Appendix  gives  examples  of 
original  (midterm)  and  related  (final  exam)  items  in  both  the  fact  and 
application  conditions.  To  summarize:  six  item  subtypes  occurred  on 


the  final;  they  resulted  from  a  3  (item  types:  exact,  related,  new)  X  2 
(knowledge  requirement  to  answer:  fact-based,  application  via  exam¬ 
ple)  factorial  design. 

One  interpretation  of  the  testing  effect  literature  makes  a  clear 
prediction  about  the  exact  items  in  both  the  factual  and  application 
modes:  to  the  extent  that  a  pattern  match  between  the  original  and 
later  items  is  helpful  in  retrieving  the  associated  answer,  we  should 
see  superior  performance  for  exactly  repeated  items  compared  with 
new  items.  There  is  a  question  about  the  related  items,  however,  and 
the  literature  on  this  has  mixed  messages  to  date.  We  will  return  to  this 
issue  in  the  General  Discussion  section.  For  practical  reasons  we 
certainly  hope  to  see  savings  for  the  related  test  items — as  educators 
we  bank  on  transfer  at  least  this  far  from  the  original  learning. 
However,  when  presented  in  another  cloak,  it  is  not  clear  when,  or 
even  that,  students  will  recognize  the  problem  type;  the  changed  cloak 
may  render  the  required  knowledge  invisible. 

Method 

Participants.  The  participants  were  214  undergraduate  stu¬ 
dents  at  the  sophomore  level  and  above  enrolled  at  the  UH.  There 
were  1 15  students  in  the  standard  exam  section;  of  these,  43  were 
in  the  lab  quiz  sections  and  72  were  in  the  no-quiz  lab  sections.  A 
total  of  99  students  were  in  the  frequent  exam  lecture  section;  67 
of  them  were  in  the  quiz  sections  and  32  were  in  the  no-quiz  lab 
sections. 

Materials  and  procedure.  In  this  study  the  frequent  testing 
section  was  offered  at  10:00  a.m.  on  Tuesdays  and  Thursdays, 
while  the  standard  testing  section  took  place  at  1 1:30  a.m.  on  those 
same  days.  There  were  28  class  meetings  across  15  weeks.  The 
two  midterms  in  the  standard  class  took  place  in  Weeks  6  and  12. 
Each  contained  24  SA  items.  The  eight  in-course  exams  in  the 
frequent  class  took  place  in  Weeks  2,  4,  5,  6,  8,  10,  12,  and  14. 
Each  had  six  SA  items.  Overall,  both  the  standard  and  the  frequent 
classes  were  given  the  same  48  midterm  items.  Two  forms  of  the 
test  were  used  in  each  exam;  in  this  study  the  two  forms  differed 
only  in  the  order  the  questions  were  asked. 

lust  as  in  Study  3,  students  in  the  quiz  sections  were  informed 
that  the  quizzes  would  be  collected  but  they  would  not  receive  a 
grade  on  them.  They  were  told  that  the  quizzes  were  for  practice 
and  for  their  benefit.  However,  contrary  to  the  method  described  in 
Study  3,  in  Study  4  the  lab  instructors  announced  quizzes  on  the 
course  syllabus  so  students  were  aware  of  when  they  were  going  to 
take  place.  All  quiz  items  were  presented  in  SA  format,  thus 
matching  the  format  used  on  the  exams.  The  lab  instructor  pro¬ 
vided  feedback  on  the  correct  answers  to  each  item  immediately 
after  each  quiz.  As  in  the  previous  study,  the  quizzes  were  admin¬ 
istered  in  lab  sections  prior  to  Exams  1,  2,  3,  5,  7,  and  8  for  the 
frequent  exam  group  and  three  of  the  six  quizzes  were  given  prior 
to  Exam  1  for  the  standard  group. 

The  final  exam  in  the  lecture  sections  contained  56  items.  The 
first  48  were  SA  questions,  eight  from  each  of  the  six  subtypes 
described  above,  that  is,  those  resulting  from  a  3  (relationship  to 
prior  exam  items:  exact,  related,  new)  X  2  (knowledge  require¬ 
ment  to  answer:  fact-based,  application  via  example)  factorial 
design.  (As  noted  earlier,  the  last  8  items  were  used  to  explore 
another  hypothesis  not  discussed  here.) 
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Results 

The  results  in  Figure  4  show  the  LS  means  for  Testing  Fre¬ 
quency  X  Item  Type.  These  results  were  produced  by  a  repeated- 
measures  ANCOVA.  We  used  orthogonal  contrasts  to  test  (a) 
whether  students  perform  better  on  repeated  (exact  and  related) 
than  on  new  items,  (b)  whether  students  performed  better  on  exact 
items  than  on  related  ones,  and  (c)  whether  students  perform  better 
on  fact-based  items  than  on  application  items.  As  performed  in 
Studies  1-3  we  used  an  additional  ANCOVA  to  directly  test  the 
effect  of  testing  frequency  and  quiz  status  on  final  exam  score. 

As  in  previous  analyses,  student  aptitude,  F(l,  205)  =  53.53, 
p  <  .0001,  was  used  as  a  covariate.  Testing  frequency,  F(l,  205)  = 
.68,  p  =  .41,  and  quiz  status,  F(l,  205)  =  1.58,  p  =  .21,  and  the 
interaction  of  testing  frequency  and  quiz  status,  F(l,  205)  =  .42, 
p  =  .52  were  not  significant.  Results  show  a  significant  main 
effect  of  item  type,  F( 5,  205)  =  1 4.54,  p  <  .0001 ,  and  a  significant 
interaction  of  testing  frequency  and  item  type,  F{ 5,  205)  =  3.91, 
p  =  .002.  Figure  4  shows  performance  on  the  final  exam  by  testing 
frequency  and  item  type.  The  interaction  of  item  type  and  quiz 
status  was  not  significant,  F(5,  205)  =  1.39,  p  —  .23. 

Students  performed  reliably  better  on  repeated  (LS  M  =  .55) 
than  on  new  items  (LS  M  —  .52;  d  =  .23),  F(l,  205)  -  15.25,  p  = 
.0001.  However,  students  did  not  perform  significantly  better  on 
exact  items  (LS  M  —  .55)  than  on  related  items  (LS  M  =  .56; 
d  =  —.08),  F(  1,  205)  =  1.77,  p  =  .19.  They  did  perform  better  on 
fact-based  items  (LS  M  =  .56)  than  on  application  items  (LS  M  = 
.52;  d  =  .31),  F(l,  205)  =  23.22,  p  <  .0001. 

An  ANCOVA  on  the  proportion  correct  from  the  final  exam 
found  that  the  overall  effect  due  to  the  frequency  of  in-class  exams 
(frequent  LS  M  =  .55;  standard  LS  M  =  .55;  d  =  .04)  was  not 
significant,  F(l,  209)  =  0.00,  p  =  .98.  Student  aptitude  was 
significant,  F(l,  209)  =  48.74,  p  <  .0001.  Further,  in  this  study  the 
effect  on  the  final  exam  due  to  quiz  status  (no  quiz  LS  M  =  .53, 
6  quizzes  LS  M  =  .56;  d  =  .23),  did  not  reach  statistical  signifi¬ 
cance,  F(l,  209)  =  2.01,  p  =  .16;  though  there  was  a  hint  of  an 
interaction  between  testing  frequency  and  quiz  status,  F(l,  209)  = 
3.02,  p  =  .08,  as  can  be  seen  in  Table  3.  The  LS  means  (standard 


Exact  Related  New 


Figure  4.  Least  squares  means  (proportion  correct)  on  the  final  exam  in 
Study  4  as  a  function  of  testing  frequency  and  item  type.  Error  bars 
represent  standard  errors  of  the  means. 


Table  3 

LS  Means  (SD)  on  Cumulative  Score  by  Testing  Frequency  and 
Quiz  Condition  for  Study  4 


Quiz  condition 

Frequent 

Standard 

Mean 

Quiz 

.54  (±.13) 

.58  (±.14) 

.56 

No  quiz 

.55  (±.16) 

.51  (±.15) 

.53 

Mean 

.55 

.55 

Note.  Table  shows  the  least  squares  (LS)  means  for  cumulative  score  in 
each  cell  of  the  Testing  Frequency  X  Quiz  interaction  in  Study  4. 


deviations)  for  proportion  correct  on  the  final  exam  for  each  of  the 
four  combinations  of  lecture  exams  (S  =  standard  =  2;  F  = 
frequent  =  8)  and  number  of  lab  quizzes  (0,  6)  are:  SO  =  .51  (.15); 
S6  =  .58  (.14);  F0  =  .55  (.16);  and  F6  =  .54  (.13). 

Discussion 

Previous  data,  specifically  the  results  of  Studies  2  and  3  (where 
the  largest  differences  between  the  frequent  and  standard  classes 
were  on  final  exam  SA  items),  led  us  to  use  exams  totally  com¬ 
prised  of  SA  items.  We  expected  even  larger  differences  between 
the  frequent  and  standard  classes  in  Study  4,  yet  no  difference 
between  the  two  classes  was  observed.  We  will  further  discuss  this 
finding  below.  Although  main  effects  and  interactions  of  testing 
frequency  and  quiz  status  were  not  significant,  students  in  the 
standard  class  tended  to  benefit  more  from  additional  quizzes  than 
students  in  the  frequent  class. 

Though  the  effect  of  quizzing  in  Study  4  was  not  statistically 
significant,  performance  was  in  the  predicted  direction  contrary  to 
the  results  of  Study  3.  As  previously  mentioned,  that  may  have 
been  due  to  slight  changes  in  our  procedure.  For  example,  in  Study 
4  the  quizzes  were  announced  in  class  and  posted  on  the  sylla¬ 
bus — but  that  was  not  the  case  in  Study  3.  It  is  also  possible  that 
we  observed  a  larger  effect  of  quizzing  in  Study  4  because  all 
exam  and  quiz  items  were  SA.  Using  this  logic,  students  in  the 
quizzing  condition  may  have  performed  better  because  they  had 
more  practice  answering  recall  items,  which  we  know  to  be  chal¬ 
lenging  for  students. 

With  respect  to  the  testing  effect,  the  results  due  to  item  type  in 
Study  4  replicated  the  findings  in  Studies  1-3  that  students  per¬ 
form  better  on  repeated  than  on  novel  items.  Interestingly,  we  also 
found  evidence  for  transfer  effects  in  that  there  was  no  statistical 
difference  between  performance  on  exact  and  related  items.  That 
is,  on  the  final  exam  students  performed  equally  well  on  items 
worded  exactly  as  previously  seen  and  on  items  that  asked  a 
slightly  different  question  (albeit  using  the  same  topical  material) 
than  was  asked  on  the  midterms.  Not  surprisingly,  students  per¬ 
formed  better  on  fact-based  items  than  on  application  items. 

t 

Overview  of  Results 

Here  we  will  provide  a  summary  and  overall  analysis  that 
captures  the  results  across  the  four  terms. 

We  calculated  two  primary  measures  of  course  performance: 
final  exam  score  and  cumulative  score,  the  latter  being  the  cumu¬ 
lative  performance  on  all  exams  including  the  midterms  and  final 
exam.  To  calculate  the  cumulative  score,  each  exam  question 
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(including  all  questions  on  both  the  midterm  exams  and  final 
exam)  was  given  the  same  weight.  Thus  the  cumulative  score 
variable  can  be  thought  of  as  the  overall  percent  correct  throughout 
the  semester.  Although  both  of  these  outcome  measures  have  some 
interest,  the  final  exam  score  is  generally  the  most  straightforward 
one  because  in  each  semester  the  administration  of  the  final  exam 
was  identical  for  all  students,  whereas  the  cumulative  score  in¬ 
cludes  data  from  the  midterms  where  frequency  of  exam  adminis¬ 
tration  and  length  of  the  exams  differed  between  the  two  classes — 
though  as  noted  earlier  the  midterm  items  were  identical  across  the 
classes  in  each  semester.  In  addition,  and  importantly,  the  average 
time  between  presentation  of  the  material  and  testing  it  on  a 
midterm  is  shorter  in  the  frequent  than  in  the  standard  classes 
(though  this  was  not  true  for  the  final  exams). 

The  results  presented  here  come  from  a  total  of  588  students  for 
whom  we  had  student  aptitude  scores  and  who  completed  the 
course.  We  omitted  students  who  dropped  the  course,  plus  an 
additional  31  students  who  did  not  have  an  aptitude  score.  The 
latter  were  recent  transfer  students  who  had  not  made  a  GPA 
record  at  UH  and  also  had  no  standard  test  scores  in  the  UH 
records.  (Though  not  reported  here,  an  analysis  that  included  those 
31  students — where  we  imputed  an  average  student  aptitude  score 
relative  to  other  students  in  the  study — showed  slightly  greater 
differences  on  the  effects  of  interest.) 

Effects  of  Prior  Testing  on  Subsequent 
Test  Performance 

A  repeated  measures  ANCOVA  (comparable  to  those  reported 
above)  was  used  to  summarize  the  results  of  Studies  1-3  and 
examine  the  set  of  key  questions  motivating  this  work.  Among 
them:  (a)  whether  on  the  final  exam  students  perform  better  on 
items  that  have  been  seen  on  previous  exams,  (b)  whether  perfor¬ 
mance  on  repeated  items  depends  on  whether  the  item  form  (MC 
or  SA)  was  consistent  across  exams,  (c)  whether  the  effect  of 
testing  frequency  depends  on  item  type,  and  (d)  whether  the 
low-stakes  quizzes  affect  final  exam  results.  In  addressing  these 
issues,  data  from  Study  4  were  analyzed  separately  from  Studies 
1-3  because  of  differences  in  the  study  design,  primarily  the  fact 
that  exams  in  Study  4  were  comprised  totally  of  SA  items. 

As  done  in  previous  analyses,  a  proportion  correct  score  for  six 
types  of  items  (MC-MC,  MC-SA,  SA-MC,  SA-SA,  new  MC,  and 
new  SA)  was  calculated  for  each  student.  These  outcome  measures 
were  regressed  on  predictors:  testing  frequency,  quiz  status,  item 
type,  student  aptitude,  and  semester.  We  used  specific  contrasts  to 
test  hypotheses  (a)  and  (b),  and  fixed  effects  results  from  the  mixed 
model  to  test  hypotheses  (c)  and  (d). 

Results  revealed  significant  main  effects  of  the  covariates  stu¬ 
dent  aptitude,  F(l,  328)  =  112.56,  p  <  .0001;  and  semester,  F(2, 
328)  =  3.56,  p  =  .03.  Item  type,  F( 5,  328)  =  75.48,  p  <  .0001, 
and  the  interaction  of  testing  frequency  and  item  type,  F( 5,  328)  = 
4.34,  p  =  .0008,  were  also  significant  (Table  4).  The  interaction  of 
testing  frequency  and  quiz  status,  F(l,  328)  =  .93,  p  =  .34,  was 
not  significant;  however,  the  interaction  between  quiz  status  and 
item  type,  F( 5,  328)  =  6.88,  p  <  .0001,  was  significant. 

We  used  orthogonal  contrasts  to  test  the  additional  hypotheses 
that: 

(a)  Students  perform  better  on  final  exam  items  that  previously 
appeared  on  a  midterm  exam  (LS  M  =  .71)  than  on  new  items  (LS 


Table  4 


LS  Means  on  Final  Exam  ( Proportion  Correct)  for  Testing 
Frequency  by  Item  Type,  Studies  1-3 


Test 

frequency 

n 

MC  items 

SA  items 

MC-MC 

SA-MC 

New  MC 

SA-SA 

MC-SA 

New  SA 

Frequent 

193 

.76 

.78 

.71 

.65 

.70 

.60 

Standard 

142 

.76 

.80 

.68 

.59 

.64 

.57 

Total 

335 

.76 

.79 

.70 

.62 

.67 

.59 

Note.  The  ns  do  not  include  the  214  students  from  Study  4;  and  39 
students  from  Study  I  who  have  missing  data;  MC-MC,  SA-MC,  and  New 
MC  were  all  multiple-choice  (MC)  items  at  the  final  exam;  SA-SA, 
MC-SA,  and  New  SA  were  all  short-answer  (SA)  items  at  the  final  exam. 


M  =  .65;  d  =  .32).  This  repeated  versus  new  item  contrast  was 
significant,  F(l,  328)  =  95.12,  p  <  .0001. 

(b)  Superior  performance  would  be  observed  for  items  that  were 
repeated  in  the  same  format,  for  example,  MC-MC  and  SA-SA  (LS 
M  —  .69)  as  opposed  to  being  flipped,  for  example,  SA-MC, 
MC-SA  (LS  M  =  .73;  d  =  —.21).  That  difference  was  also 
significant  F(l,  328)  =  24.42,  p  <  .0001,  though  on  average  the 
flipped  items  performed  better  than  the  identical  ones,  contrary  to 
expectations. 

To  examine  hypothesis  (c):  whether  the  effect  of  testing  fre¬ 
quency  depends  on  item  type,  we  used  results  from  the  same 
repeated  measures  described  above  to  evaluate  the  main  effects  of 
testing  frequency  and  item  type  and  their  interaction.  Table  4 
shows  the  LS  means  (proportion  correct)  on  the  final  exams  from 
Studies  1-3.  More  specifically,  this  table  shows  the  mean  propor¬ 
tion  of  items  correct  associated  with  each  cell  in  the  combination 
of  testing  frequency  (frequent  vs.  standard)  and  item  type  (6  types 
of  items);  and  the  model  tests  the  main  effects  of  these  two 
variables  and  their  interaction.  Results  from  this  model  show  a 
significant  main  effect  of  item  type,  F( 5,  328)  =  75.48,  p  <  .0001, 
and  a  significant  interaction  between  testing  frequency  and  item 
type,  F( 5,  328)  =  4.34,  p  =  .0008.  The  main  effect  of  testing 
frequency,  F(l,  328)  =  3.11,  p  =  .08,  approached  significance. 

To  examine  hypothesis  (d),  whether  the  low-stakes  quizzes 
affected  final  exam  results,  we  used  the  fixed  effects  results  from 
the  same  mixed  linear  model  as  was  used  to  test  hypotheses 
(a)-(c).  The  results  show  that  while  overall  the  quiz  condition,  F(l, 
328)  =  .66,  p  =  .42,  failed  to  predict  final  exam  performance,  we 
did  observe  a  significant  interaction  between  quiz  status  and  item 
type,  F(5,  328)  =  6.88,  p  <  .0001.  Table  5  shows  the  results  (LS 
means)  for  the  interaction  of  quiz  and  item  type  in  Studies  1-3. 
Though  not  directly  related  to  any  specific  hypothesis  in  this 
analysis,  we  observed  that  the  effect  of  testing  frequency  did  not 
depend  on  the  level  of  the  quiz  variable,  F(l,  328)  =  .93,  p  =  .34. 

Frequency  of  Testing 

To  address  questions  about  frequency  of  testing  and  the  testing 
effect  (do  students  perform  better  on  previously  tested  items?),  and 
the  effect  of  low-stakes  quizzes  (do  students  perform  better  when 
given  such  quizzes?),  we  provide  results  from  an  ANCOVA  (SAS 
9.4)  with  four  factors:  student  aptitude,  testing  frequency,  quiz 
status,  and  semester.  The  semester  variable  was  used  to  control  for 
differences  in  the  administration  of  each  study,  and  of  course  the 
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Table  5 


LS  Means  on  Final  Exam  (Proportion  Correct)  for  Quiz  by  Item 
Type,  Studies  1-3 


Quiz 

condition 

n 

MC  items 

SA  items 

MC-MC 

SA-MC 

NEW 

MC 

SA-SA 

MC-SA 

New  SA 

Quiz 

91 

.77 

.80 

.68 

.61 

.70 

.62 

No  quiz 

244“ 

.75 

.78 

.72 

.63 

.64 

.55 

Total 

335 

.76 

.79 

.70 

.62 

.67 

.59 

a  Includes  participants  from  Studies  1  and  2,  none  of  whom  were  given 
quizzes;  MC-MC,  SA-MC,  and  New  MC  were  ail  multiple-choice  (MC) 
items  at  the  final  exam;  SA-SA,  MC-SA,  and  New  SA  were  all  short- 
answer  (SA)  items  at  the  final  exam. 

participants  differed  across  semesters.  We  first  fit  a  full  model  that 
includes  all  four  factors  mentioned  above,  and  then  we  examine  a 
reduced  model  that  omits  some  nonsignificant  factors  to  evaluate 
the  effect  of  testing  frequency  in  the  most  parsimonious  context. 

Final  Exam  Performance 

Full  model.  Naturally,  there  was  a  large  effect  for  student 
aptitude  on  final  exam  performance,  F(l,  577)  =  218.34,  p  < 
.0001,  and  also  a  significant  main  effect  for  Semester,  F( 3,  577)  = 
32.50,  p  <  .0001.  Table  6  shows  both  the  raw  means  (proportion 
correct)  for  each  condition  and  the  LS  mean  for  testing  frequency 
in  this  model.  (There  were  64  final  exam  items  in  Studies  1-3  and 
48  final  exam  items  in  Study  4.)  The  results  from  the  full  model  on 
the  final  exam  score  indicate  that  testing  frequency,  F(  1,  577)  = 
2.23,  p  =  .14,  was  not  significant.  The  interaction  of  semester  and 
testing  frequency  was  not  significant,  F( 3,  577)  =  1.78,  p  =  .15. 
The  main  effect  of  quiz  status  was  not  significant  (quiz  LS  M  = 
.66;  no-quiz  LS  M  =  .63),  F(l,  577)  =  1.65 ,p  =  .19,  nor  was  the 
interaction  between  testing  frequency  and  quiz  status,  F(l,  577)  = 
.96,  p  =  .33. 

Reduced  model.  Evidence  from  the  Type  I  sums  of  squares  in 
the  full  model  on  final  exam  score  suggested  that  testing  frequency 


exhibits  a  significant  effect  when  controlling  for  student  aptitude. 
For  this  reason,  a  reduced  model  was  used  to  further  examine  the 
effect  of  testing  frequency,  while  omitting  nonsignificant  factors 
(quiz  status  and  the  interaction  of  testing  frequency  and  quiz 
status).  This  reduced  model  includes  the  factors:  student  aptitude, 
testing  frequency,  and  semester,  and  the  interaction  of  testing 
frequency  and  semester. 

In  the  reduced  model  student  aptitude,  F(l,579)  =  219.62,  p  < 
.0001  and  semester  F(3,  579)  =  5.29,  p  <  .0001  again  show  strong 
effects.  We  also  observed  a  significant  effect  of  testing  frequency, 
F(  1,  579)  =  5.29,  p  =  .02.  The  frequent  class  outperformed  the 
standard  class  on  the  final  exam  in  Studies  1—4  where  we  observed 
effect  sizes  (Cohen’s  d)  of  .07,  .20,  .28,  and  .04,  respectively. 
Figure  5  shows  the  LS  means  for  final  exam  score  by  condition  in 
Studies  1-4.  The  interaction  of  semester  and  testing  frequency, 
F( 3,  579)  =  1.37,  p  =  .25  was  not  significant. 

Cumulative  Exam  Performance 

Full  model.  When  the  outcome  variable  cumulative  score 
(also  shown  in  Table  6)  is  used,  a  parallel  (full  model)  analysis 
again  found  that  both  student  aptitude,  F(l,  577)  =  277.92,  p  < 
.0001,  and  semester,  F( 3,  577)  =  14.75,  p  <  .0001,  had  large 
effects.  (Cumulatively,  there  were  104  test  items  in  the  Studies 
1  and  4  and  112  in  Studies  2  and  3.)  For  these  data,  even  in  the 
full  model  we  observed  a  significant  main  effect  of  testing 
frequency,  F(l,  577)  =  6.83,  p  =  .01.  The  interaction  between 
testing  frequency  and  semester  was  not  significant,  F( 3,  577)  = 
.77,  p  =  .38.  The  difference  in  LS  means  for  students  in  the  quiz 
condition  (.64)  and  the  no  -quiz  condition  (.63)  was  not  statis¬ 
tically  significant  (d  =  .05),  F(l,  577)  =  .77 ,  p  =  .38  nor  was 
the  interaction  of  testing  frequency  and  quiz  status,  F(l,  577)  = 
1.31,  p  =  .35. 

Reduced  model.  The  reduced  model  on  cumulative  score 
(analogous  to  the  reduced  model  on  final  exam  performance) 
showed  comparable  results  to  the  full  model  for  the  main  effects  of 
student  aptitude  and  semester.  However,  the  main  effect  of  testing 
frequency,  F(l,  579)  =  13.15,  p  =  .0003,  was  significantly  stron¬ 
ger  in  the  reduced  model.  As  observed  in  the  full  model,  the 


Table  6 


Descriptive  Statistics  for  Studies  1-4 


Experiment 

Condition 
(number  of  exams) 

n 

Final  exam 
mean 

Final  exam 
LS  mean 

Final  exam 
SD 

Cohen’s  d  on 
LS  means 
final  exam 

Cumulative 
score  mean 

Cumulative 
score  LS 
mean 

Cumulative 
score  SD 

Cohen’s  d  on 
LS  means 
cumulative 

Study  1 

Frequent  (4) 

84 

.66 

.66 

.15 

.07 

.67 

.67 

.13 

.15 

Standard  (2) 

75 

.64 

.65 

.64 

.65 

Study  2 

Frequent (8) 

34 

.73 

.71 

.15 

.20 

.71 

.68 

.14 

.43 

Standard  (2) 

36 

.66 

.68 

.63 

.62 

Study  3 

Frequent (8) 

75 

.69 

.68 

.18 

.28 

.68 

.67 

.16 

.31 

Standard  (2) 

70 

.62 

.63 

.61 

.62  t 

Study  4 

Frequent  (8) 

99 

.55 

.55 

.13 

.04 

.56 

.59 

.15 

.20 

Standard  (2) 

115 

.53 

.55 

.52 

.56 

Total 

Frequent 

292 

.64 

.65 

.19 

.16 

.64 

.65 

.17 

.24 

Standard 

296 

.59 

.62 

.58 

.61 

Note.  Final  exam  mean  is  the  raw  mean  score  (proportion  correct)  on  the  final  exam;  final  exam  least  squares  (LS)  mean  is  the  mean  score  on  the  final 
exam  over  a  balanced  population  (e.g.,  controlling  for  differences  in  student  aptitude);  LS  means  estimates  for  both  final  exam  and  cumulative  score  are 
taken  from  the  respective  reduced  models;  cumulative  score  mean  is  the  mean  total  score  on  all  midterm  exams  and  the  final  exam;  Cohen’s  d  was  calculated 
by  the  dividing  the  difference  between  the  frequent  and  standard  class  LS  mean  by  the  pooled  within-group  standard  deviation;  the  effect  sizes  for  each 
study  and  the  overall  effect  size  for  each  measure  are  reported. 
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Figure  5.  Least  squares  means  (proportion  correct)  on  the  final  exam  in 
Studies  1—4  as  a  function  of  testing  frequency.  Error  bars  represent 
standard  errors  of  the  means. 


■  Frequent 
"/Standard 


frequent  class  consistently  outperformed  the  standard  class  in 
Studies  1-4,  where  we  estimate  effect  sizes  (d)  of  .15,  .43,  .31,  and 
.20,  respectively  (see  Table  6).  Figure  6  shows  the  LS  means  for 
cumulative  score  by  condition  in  Studies  1-4.  The  interaction  of 
testing  frequency  and  semester  was  not  significant,  F( 3,  579)  = 
.76,  p  =  .52. 


General  Discussion 

While  the  methodology  used  in  these  four  studies  differed 
somewhat  from  semester  to  semester,  there  were  fundamental 
conceptual  similarities  among  the  manipulations  we  carried  out 
and  considerable,  though  not  total,  consistency  in  the  results  as  can 
be  seen  by  again  inspecting  Tables  4  and  6. 

We  can  summarize  the  overall  results  as  follows: 

1.  More  frequent  testing  generally,  though  not  invariably, 
led  to  better  final  exam  and  overall  course  results. 

This  replicates  a  phenomenon  that  has  been  observed  many 
times  over  the  years,  though  not  often  in  college  classes  over  an 
entire  semester,  and  rarely  if  ever  before  replicated  in  the  fashion 
of  this  work  which  allows  us  to  see  a  range  of  effect  sizes  along 
with  variation  of  procedures  within  a  common  framework.  In 
addition  to  the  direct  evidence,  there  is  additional  confirmation  of 
the  benefits  due  to  frequent  testing  embedded  in  the  results  of  the 
various  methods  used  in  our  experiments.  For  example,  we  ob¬ 
served  a  larger  effect  size  of  frequent  testing  on  the  final  exam  in 
Study  2  (d  =  .20)  than  in  Study  1  (<f  =  .07).  The  main  difference 
between  these  two  studies  was  the  increase  in  number  of  exams  in 
the  frequent  class  from  four  in  Study  1  to  eight  in  Study  2.  Though 
we  did  not  directly  test  the  effect  of  four  versus  eight  exams  in  any 
one  study,  this  work  provides  evidence  that  (at  least  up  to  a  point) 
more  frequent  testing  improves  performance  on  the  final  exam  in 
a  full  semester  course. 

In  Studies  2  and  3  the  increase  in  performance  of  the  frequent 
class  over  the  standard  class  was  not  only  reliable  by  usual  stan¬ 
dards,  but  was  of  meaningful  size:  12%-17%  on  the  total  cumu¬ 
lative  scores.  In  practice,  that  was  an  average  improvement  of 
better  than  1/2  a  letter  grade  in  the  current  courses.  To  give  a  sense 
of  the  practical  benefit  of  such  improved  performance,  depending 


upon  the  size  of  the  class  and,  of  course,  on  the  grading  scheme 
adopted,  it  could  allow  a  consequential  number  of  additional 
students  to  earn  passing  grades.  That  is  a  significant  (in  both 
senses)  improvement  at  very  little  additional  cost  and  effort.  Note 
that  the  resources  required  to  grade  the  exams,  including  time  to  do 
so,  are  quite  similar  across  the  two  conditions.  Thus,  giving  more 
frequent  exams  (e.g.,  8  per  semester)  would  be  an  inexpensive  and 
easily  adoptable  and  adaptable  modification  to  many  college 
courses. 

Though  we  report  a  relatively  consistent  trend  for  the  frequent 
class  to  outperform  the  standard  class  over  four  semesters,  when 
we  analyzed  performance  on  the  different  items  types  we  observed 
a  significant  interaction  between  testing  frequency  and  item  type  in 
Studies  1-3.  The  data  in  Table  4  show  that  the  frequent  and 
standard  classes  did  not  differ  on  MC  final  exam  items,  but  that  the 
frequent  class  consistently  outperformed  the  standard  class  on  final 
exam  SA  items.  This  finding  suggests  that  frequent  testing  may 
have  large  benefits  for  items  that  require  recall,  and  a  smaller 
effect  on  recognition  items.  However,  that  conclusion  is  consider¬ 
ably  tempered  by  the  results  of  Study  4  where  the  frequent  class 
did  not  outperform  the  standard  class  on  the  final  exam,  and  where 
the  test  items  were  in  SA  format.  Perhaps  the  fact  that  the  items 
were  substantially  more  difficult  than  in  the  earlier  studies  (about 
15%  fewer  correct  answers,  on  average)  played  a  role  in  this 
finding.  With  fewer  correct  answers,  participants  may  have  had 
less  internal  “good  advice”  to  draw  upon  when  attempting  to 
answer  final  exam  items.  That  is,  less  was  learned  from  the  exams 
themselves.  This  discrepancy  in  the  overall  findings  for  the  fre¬ 
quency  effect  requires  further  unpacking  in  the  future. 

Furthermore,  we  also  observed  an  interaction  between  quiz  and 
item  type.  Students  in  the  quiz  condition  showed  superior  perfor¬ 
mance  on  final  exam  SA  items  (see  Table  5).  As  we  defined  them, 
testing  frequency  and  quiz  are  primarily  distinct  in  that  testing 
frequency  manipulates  the  distribution  of  exams  (with  the  total 
number  of  exam  items  held  constant),  and  quizzing  provides 
addition  retrieval  practice.  Thus  students  in  the  quiz  condition 
(regardless  of  whether  in  the  frequent  or  standard  class)  received 
the  opportunity  to  answer  36  additional  questions  over  the  course 
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Figure  6.  Least  squares  means  (proportion  correct)  on  cumulative  score 
in  Studies  1-4  as  a  function  of  testing  frequency.  Error  bars  represent 
standard  errors  of  the  means. 
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of  the  semester.  This  additional  retrieval  practice  may  be  respon¬ 
sible  for  improved  performance  on  SA  (recall)  items. 

2.  The  repetition  or  testing  effect  is  generally  present  and 
meaningful  in  these  ecologically  valid,  semester-long 
studies. 

In  all  four  studies  we  found  that  students  perform  better  on 
repeated  items  (those  seen  previously  on  midterm  exams)  than  on 
items  that  had  not  been  previously  tested.  This  finding  remained 
highly  reliable  even  when  in  Study  4  we  included  related  items  in 
the  repeated  category.  This  finding  replicates  several  findings  that 
show  testing  or  retrieval  effects  in  practical  educational  settings. 
Our  studies  differ  from  most  other  research  on  the  testing  effect  in 
that  they  each  took  place  over  an  entire  semester.  These  findings 
suggest  that  testing  on  material  (even  once)  and  receiving  feedback 
can  improve  long-term  retention. 

3.  The  testing  effect  generally  held,  even  when  the  format 
of  the  item  changed  (e.g.,  from  MC  on  the  first  test  to  SA 
on  the  final,  and  vice  versa). 

The  percentage  of  questions  answered  correctly  on  the  final 
exam  increased  by  about  8%  for  MC  items,  and  about  5%  for  SA 
items  in  the  testing  effect  conditions — that  is,  when  the  items  had 
appeared  on  earlier  exams.  Again,  these  are  meaningful  increases 
in  performance  that  make  a  difference  in  the  grade  distributions  in 
large  class  sections. 

It  is  also  notable  that  participants  in  both  the  frequent  and  the 
standard  groups  improved  when  presented  with  a  repeated  item 
whether  given  in  identical  or  flipped  format.  This  finding  is  con¬ 
sistent  with  data  from  Butler  (2010)  and  replicates  observations 
made  by  Bjork  et  al.  (2014),  and  by  McDermott  et  al.  (2014).  Put 
another  way,  the  testing  effect  transferred  to  items  presented  in  an 
alternate  format.  Indeed,  our  students  generally  did  somewhat 
better  on  the  flipped  items  than  on  the  identical  ones.  We  consider 
that  a  surprising  and  likely  important  finding — one  also  remarked 
upon  by  McDermott  et  al.  (2014) — though,  as  noted  in  the  intro¬ 
duction,  not  everyone  has  observed  it  (e.g.,  Wooldridge  et  al., 
2014).  If  it  does  exist,  how  would  we  account  for  the  apparent 
superiority  (or  even  equality)  of  flipped  and  related  questions 
compared  with  identical  ones  asked  on  the  final  exam? 

Bjork  et  al.  (2014)  “found  that  test-takers  ability  to  answer 
related  questions  on  a  delayed  test  was  only  enhanced  when  the 
correct  answers  to  such  questions  had  been  plausible  incorrect 
alternative  on  the  previous  test  ...”  (p.  169).  They  suggest  that 
students  who  process  those  alternative  answers  learn  distinctions 
that  can  help  them  answer  subsequent,  related  questions.  An  in¬ 
spection  of  Table  4  suggests  some  support  for  that  idea:  namely,  an 
increased  probability  of  getting  an  SA  item  correct  on  the  final 
exam  when  preceded  by  a  flipped  MC  item  (.67)  than  when 
preceded  by  the  same  SA  item  (.62).  However,  there  was  no 
difference  in  probability  of  getting  an  MC  item  correct  on  the  final 
exam  whether  preceded  by  an  MC  (.76)  or  an  SA  (.79)  item. 
Admittedly,  we  may  be  observing  a  ceiling  effect  here;  and  our 
items  do  not  really  parallel  those  used  by  Bjork  et  al.  (2014). 

Stepping  back,  there  is  an  extensive  literature  on  transfer  of 
training,  one  of  the  oldest  topics  in  experimental  psychology  (e.g., 
Barnett  &  Ceci,  2002;  Detterman  &  Sternberg,  1993;  Harrison, 


Shipstead,  &  Engle,  2015;  Mayer  &  Wittrock,  1996;  Singley  & 
Anderson,  1989;  Taatgen,  2013).  Almost  any  explanation  of  trans¬ 
fer  would  predict  that  an  identical  test  item  should  more  readily  be 
associated  with  the  previously  presented  one  than  would  a  changed 
test  item.  All  the  surface  cues  are  the  same  in  the  identical  case. 
However,  in  our  studies  that  result  appears  not  to  hold.  What’s 
different?  For  one  thing,  in  this  work  a  great  deal  of  time — up  to 
3  months — could  pass  between  the  original  test  and  the  final  one. 
In  addition,  the  intervening  time  is  not  just  filled  with  normal 
everyday  life  going  by.  Importantly,  we  think,  additional  related 
and  relevant  material  is  presented  to  the  participants  during  that 
time  and  they  are  building  an  extensive  and  interrelated  knowledge 
base.  Too,  they  attempt  to  retrieve  or  recall  information  from  that 
knowledge  base  on  subsequent  exams  and  on  other  occasions  (e.g., 
working  homework  problems  on  basic  statistical  concepts).  Under 
those  circumstances,  maximally  effective  interrogation  of  the  com¬ 
plex  knowledge  representation  may  not  require  pattern  matching 
with  the  original  form  of  the  question. 

To  make  an  accurate  prediction  of  transfer,  we  may  need  to 
consider  both  the  length  of  time  since  initial  presentation  and, 
importantly,  the  structure  of  the  resulting  representation — one 
constructed  from  the  intervening  material  as  well  as  from  the 
material  tested  on  the  early  exams.  Thus,  there  may  be  multiple 
effective  cues  that  can  access  prior  information,  now  a  component 
of  a  more  complex  representational  system.  In  addition,  we  need  a 
measure  of  how  well  the  new  question  overlaps  with  that  changing 
representation.  Thus,  what  appears  as  a  “far”  transfer  item  on  the 
surface  may  in  fact  be  a  “near”  one  when  we  compare  it  to  the 
latent  representation  of  what  the  student  knows  (see,  e.g.,  Taatgen, 
2013,  for  a  discussion  of  far  transfer  in  the  skill  domain). 

In  sum,  should  this  finding  hold  up,  it  again  points  to  the  need 
for  better  understanding  of  transfer,  especially  with  long  time 
delays  partially  filled  with  complex,  interrelated  materials.  After 
all,  our  goal  as  educators  is  to  present  and  test  information  such 
that  it  leads  to  a  greater  likelihood  of  being  available  to  aid  both 
question  answering  and  problem  solving  at  much  later  times. 

4.  We  did  not  observe  an  overall  benefit  from  taking  the 
low-stakes  quizzes. 

The  failure  to  see  benefits  of  low-stakes  quizzes  could  be  due  to 
several  factors.  Our  results  showed  no  difference  (or  even  a  neg¬ 
ative  effect)  between  performance  in  the  quiz  and  no-quiz  condi¬ 
tion  in  Study  3.  However,  in  Study  4  we  showed  a  slight  advantage 
to  taking  quizzes.  As  previously  mentioned,  advance  announce¬ 
ment  of  the  quizzes  and  the  testing  format  used  in  Study  4  could 
have  influenced  the  results.  Because  no  grade  depended  on  indi¬ 
vidual  quiz  performance  there  may  also  be  motivational  factors 
that  modulate  the  effect  of  low-stakes  quizzes.  An  extensive  and 
wide-range  of  research  on  motivation  and  performance  shows 
evidence  for  individual  differences  in  Jhe  origin  of  motivation  to 
perform  in  educational  tasks  (e.g.,  Wolters,  Denton,  York,  & 
Francis,  2014).  Studies  in  the  motivation  literature  also  show  how 
incentives  can  play  a  large  role  in  performance  (e.g.,  Duckworth, 
Quinn,  Fynam,  Foeber,  &  Stouthamer-Foeber,  2011).  If  motiva¬ 
tion  and  incentives  can  modulate  performance  on  exams,  it  is 
intuitive  that  these  factors  play  a  role  in  the  effectiveness  of 
educational  training  conditions  such  as  those  employed  in  the 
present  work.  So  here,  though  not  only  here,  we  endorse  the 
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recommendation  that  further  research  is  necessary  for  both  theo¬ 
retical  reasons  and  to  explore  the  limits  of  practical  applications  of 
this  work. 

One  methodological  point  is  worth  revisiting:  The  inability  to 
randomly  assign  participants  to  conditions  poses  a  potential  limi¬ 
tation  to  our  conclusions.  We  dealt  with  this  issue  by  obtaining 
covariates  for  student  aptitude  (GPA,  SAT,  and  ACT  scores)  and 
including  them  in  our  analyses,  and  by  alternating  the  time  (10:00 
a.m.  or  11:30  a.m.)  of  the  testing  frequency  treatment  across 
semesters.  Furthermore,  we  acknowledge  the  limitation  that  testing 
frequency  systematically  varied  with  the  classroom — that  is,  ef¬ 
fects  of  classroom  clustering  (classroom  environment)  could  have 
influenced  our  results.  Given  that  we  used  one  instructor,  we  are 
unable  to  estimate  effects  due  to  class  cluster. 

In  summary,  the  findings  in  Studies  1-4  suggest  that  frequent 
testing  and  making  use  of  the  testing  effect  can  improve  perfor¬ 
mance  on  a  comprehensive  final  exam  (and  the  cumulative  course 
performance)  in  “live”  classroom  settings  over  an  entire  semester. 
It  appears  that  typically,  but  not  inevitably  (perhaps  when  item 
difficulty  was  very  high),  the  effect  of  frequent  testing  is  greatest 
on  difficult  (e.g.,  SA)  items  that  require  recall  rather  than  recog¬ 
nition.  For  the  most  part,  these  appear  to  be  inexpensive  and 
readily  scalable  findings.  We  also  found  that  the  benefit  of  prior 
testing  was  not  limited  to  items  that  are  exactly  repeated,  but 
appeared  to  generalize  to  new  questions  that  tapped  the  same 
information  in  order  to  answer  them.  In  addition,  we  observed 
small  to  nonexistent  effects  of  no-stakes  quizzes,  but  certainly  feel 
that  this  matter  requires  further  exploration  for  the  reasons  noted 
above. 

Finally,  we  acknowledge  that  there  is  much  left  to  do  before  we 
understand  the  differences  in  effect  sizes  that  we  and  others  have 
observed,  and  to  explore  further  the  transfer  of  training  issues  that 
arise  both  in  laboratory  studies  and  larger  scale  investigations  such 
as  reported  here. 
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Appendix 

Example  of  Related  Items  From  Study  4 

Fact-Based  (Definition)  Items 

Midterm,  it  another  variable  systematically  co-varies  with  the  independent  variable,  then  we  likely  have  a 

Final.  A  confound  likely  exists  when  an  extraneous  variable  systematically  co-varies  with  the 

Midterm.  Inferential  statistics  allow  us  to  draw  conclusions  about  the _ on  the  basis  of  data  from  a 

Final.  When  we  draw  a  conclusion  about  a  population  on  the  basis  of  a  sample,  we  are  using _ 

statistics. 

Application  Items 

Midterm:  Mr.  Thinblood  was  bom  in  Houston  and  moved  to  New  York.  He  found  himself  feeling  tired  in  the  winter — more  tired  than 
he  remembered  being  in  Houston  during  the  winter.  He  wondered  whether  cold  weather  makes  people  tired.  How  would  we  best  ask  this 
question  scientifically  (in  standard  form)? 


Final:  Mr.  C.  D.  Cloud  believes  that  whenever  he  washes  his  car  it  greatly  increases  the  chances  of  rain.  If  he  turned  his  belief  into  a 
scientific  question,  how  would  he  express  it  (in  standard  form)? 


Midterm:  A  person  is  at  the  84th  percentile  on  a  standardized  test.  Given  that  the  sample  mean  is  60  and  SD  =  9,  what  is  this  person’s 
raw  score? 


Final:  On  an  exam  the  mean  is  50  and  the  SD  —  5.  You  ended  up  at  the  16th  percentile  on  this  exam.  What  score  did  you  make? 
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Many  cognitive  self-regulation  (CSR)  measures  are  related  to  the  academic  achievement  of  prekinder- 
garten  children  and  are  thus  of  potential  interest  for  school  readiness  screening  and  as  outcome  variables 
in  intervention  research  aimed  at  improving  those  skills  in  order  to  facilitate  learning.  The  objective  of 
this  study  was  to  identify  learning-related  CSR  measures  especially  suitable  for  such  purposes  by 
comparing  the  performance  of  promising  candidates  on  criteria  designed  to  assess  their  educational 
relevance  for  pre-K  settings.  A  diverse  set  of  12  easily  administered  measures  was  selected  from  among 
those  represented  in  research  on  attention,  effortful  control,  and  executive  function,  and  applied  to  a  large 
sample  of  pre-K  children.  Those  measures  were  then  compared  on  their  ability  to  predict  achievement 
and  achievement  gain,  responsiveness  to  developmental  change,  and  concurrence  with  teacher  ratings  of 
CSR-related  classroom  behavior.  Four  measures  performed  well  on  all  those  criteria:  Peg  Tapping, 
Head-Toes-Knees-Shoulders,  the  Kansas  Reflection-Impulsivity  Scale  for  Preschoolers,  and  Copy  De¬ 
sign.  Two  others.  Dimensional  Change  Card  Sort  and  Backwards  Digit  Span,  performed  well  on  most  of 
the  criteria.  Cross-validation  with  a  new  sample  of  children  confirmed  the  initial  evaluation  of  these 
measures  and  provided  estimates  of  test-retest  reliability. 


Educational  Impact  and  Implications  Statement 

The  ability  of  prekindergarten  children  to  regulate  such  cognitive  functions,  as  attention  and  task 
persistence  is  related  to  their  learning  and  academic  achievement.  This  study  identified  measures  of 
such  learning-related  cognitive  self-regulation  especially  suitable  for  screening  pre-k  children  for 
school  readiness  and  as  outcome  measures  for  interventions  aimed  at  improving  those  skills. 


Keywords:  cognitive  self-regulation,  executive  function,  school  readiness,  measurement 


The  ability  of  young  children  to  exert  control  over  their  cogni¬ 
tion  and  behaviors  within  educational  contexts  has  been  variously 
labeled  approaches  to  learning  (Davoudzadeh,  McTeman,  & 


Grimm,  2015;  Zimmerman,  1990),  learning  dispositions  (Katz, 
1993,  2002),  and  work-related  skills  (Cooper  &  Farran,  1988, 
1991;  Schmitt,  Pratt,  &  McClelland,  2014).  However  labeled. 
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ample  research  has  demonstrated  that  children’s  ability  to  focus  on 
classroom  tasks,  persist  despite  difficulty,  and  engage  in  learning 
activities  are  positively  related  to  academic  achievement  (Duncan 
et  al.,  2007;  Li-Grining,  Votruba-Drzal,  Maldonado-Carreno,  & 
Haas,  2010;  McClelland,  Morrison,  &  Holmes,  2000;  Morgan, 
Farkas,  &  Wu,  201 1).  The  constellation  of  skills  that  support  this 
behavior  can  be  referred  to  broadly  as  cognitive  self-regulation. 

Research  on  cognitive  self-regulation  (CSR)  has  been  conducted 
within  various  conceptual  frameworks  including  attention,  execu¬ 
tive  function,  and  effortful  control.  Attentional  functions  such  as 
conscious  detection  and  sustained  focus  on  a  target  stimulus  are 
foundational  aspects  of  one’s  ability  to  control  thoughts  and  be¬ 
haviors  (Posner  &  Rothbart,  2000;  Rothbart  &  Ahadi,  1994). 
Executive  function,  in  turn,  is  generally  defined  as  a  set  of  cogni¬ 
tive  abilities  that  aid  in  the  completion  of  goal-directed  actions 
(Hughes  &  Ensor,  2011;  Miyake  et  al.,  2000).  These  abilities 
include  adapting  or  shifting  actions  to  changing  situational  de¬ 
mands  (Zelazo,  Frye,  &  Rapus,  1996),  active  maintenance  and 
manipulation  of  information  in  working  memory  (Baddeley  & 
Hitch,  1974),  and  inhibition  of  inappropriate  but  prepotent  re¬ 
sponses  (Diamond,  1990).  Related  to  inhibitory  control  is  the 
construct  of  effortful  control,  which  involves  volitional  behavioral 
regulation  related  to  aspects  of  temperament  (Kochanska,  Murray, 
&  Harlan,  2000). 

A  number  of  assessments  of  CSR-related  constructs  suitable  for 
administration  directly  to  pre-K  age  children  have  been  developed 
within  these  research  contexts,  and  many  of  them  have  been  shown  to 
be  related  to  concurrent  or  future  academic  achievement  (Allan  & 
Lonigan,  2011;  Blair  &  Razza,  2007;  Gathercole,  Brown,  &  Picker¬ 
ing,  2003;  Jacob  &  Parkinson,  2015;  Lan,  Legare,  Ponitz,  Li,  & 
Morrison,  2011)  and  achievement  gains  during  the  pre-K  and  kinder¬ 
garten  years  (Fuhs,  Nesbitt,  Farran,  &  Dong,  2014;  Matthews,  Ponitz, 
&  Morrison,  2009;  McClelland  et  al.,  2007;  Ponitz,  McClelland, 
Matthews,  &  Morrison,  2009;  Welsh,  Nix,  Blair,  Bierman,  &  Nelson, 
2010).  Indeed,  evidence  indicates  that  cognitive  self-regulation  mea¬ 
sures  are  among  the  strongest  predictors  of  achievement  after  prior 
measures  of  achievement  itself  (Duncan  et  al.,  2007). 

Aside  from  whatever  theoretical  insights  derive  from  this  re¬ 
search,  the  relations  of  such  measures  to  the  academic  achievement 
of  pre-K  children  has  particular  importance  in  educational  con¬ 
texts.  Most  immediately,  the  measures  most  strongly  related  to 
achievement  might  be  used  in  assessments  of  school  readiness  to 
identify  children  whose  CSR  skills  may  not  be  sufficient  to  support 
effective  engagement  in  the  learning  opportunities  in  kindergarten 
and  beyond  and  thus  need  help  enhancing  those  skills.  Those 
measures,  in  turn,  would  be  appropriate  targets  as  outcome  vari¬ 
ables  in  intervention  research  aimed  at  finding  ways  those  skills 
can  be  improved  to  better  facilitate  learning  in  classroom  environ¬ 
ments. 

But,  which  of  the  many  measures  of  CSR  that  can  be  used  with 
pre-K  aged  children  are  especially  suitable  for  these  purposes? 
Most  prior  studies  reporting  relations  with  achievement  have  fo¬ 
cused  on  only  a  few  measures  and  typically  did  not  have  compar¬ 
ison  of  those  relations  as  their  main  objective.  Moreover,  different 
studies  have  used  different  achievement  measures  and  different 
samples,  features  that  could  themselves  influence  the  magnitude  of 
the  relations,  thus  making  it  difficult  to  compare  the  performance 
of  measures  used  in  different  studies.  And  no  study  has  systemat¬ 
ically  assessed  CSR  measures  with  regard  to  the  multiple  attributes 


that  would  make  them  most  educationally  relevant  to  pre-K  stu¬ 
dents. 

The  purpose  of  the  study  reported  here  is  to  make  just  such  a 
comparative  assessment  for  a  group  of  candidate  measures  se¬ 
lected  to  represent  a  range  of  CSR  skills  while  also  being  easily 
administered  to  young  children.  The  aim  of  this  assessment  is  to 
identify  CSR  measures  with  clear  educational  relevance  for  pre-K 
children;  that  is,  measures  that  perform  especially  well  when  the 
interplay  between  CSR  and  achievement  is  of  interest  in  pre-K 
classroom  settings.  The  results,  in  turn,  are  intended  to  provide 
guidance  to  pre-K  researchers  and  practitioners  seeking  measures 
of  CSR  for  screening  or  research  applications  that  have  sound 
measurement  properties  and  demonstrated  relations  to  learning  and 
CSR-related  classroom  behavior. 

Criteria  for  Evaluating  CSR  Measures 

A  comparative  assessment  focused  on  the  educational  relevance 
of  CSR  measures  for  pre-K  students  first  requires  decisions  about 
the  basis  for  selecting  candidate  measures  and  specification  of 
appropriate  criteria  with  which  to  assess  them.  To  identify  prom¬ 
ising  candidate  measures,  we  did  not  attempt  to  apply  strict  selec¬ 
tion  criteria  but  used  the  informed  judgment  of  our  research  team 
to  pick  measures  that  represented  a  range  of  CSR  skills  and  tasks 
(described  in  more  detail  below)  and  to  favor  measures  more 
widely  known  and  used  in  early  childhood  research.  Further,  with 
practical  application  and  broad  utility  in  mind,  we  considered  only 
measures  that  could  be  easily  administered  in  school  settings  by 
school  personnel  or  researchers  with  limited  resources;  that  is, 
those  that  could  be  completed  in  a  relatively  brief  period  without 
specialized  equipment  or  online  Internet  connections.  A  similar 
assessment  of  computer-based  CSR  measures  would  be  informa¬ 
tive,  but  for  this  study  we  chose  to  focus  on  readily  accessible 
measures  so  the  findings  would  be  as  broadly  useful  as  possible. 

To  assess  the  relative  performance  of  the  selected  CSR  mea¬ 
sures,  we  identified  a  set  of  attributes  we  judged  to  be  indicative  of 
their  educational  relevance  for  use  in  pre-K  contexts.  The  most 
important  of  these,  of  course,  involved  the  relation  of  the  measures 
to  academic  achievement.  Three  types  of  relations  were  differen¬ 
tiated.  First,  we  examined  correlations  between  the  CSR  measures 
administered  at  the  beginning  of  the  pre-K  year  and  later  achieve¬ 
ment.  With  our  focus  on  CSR  skills  related  to  learning,  the  most 
educationally  relevant  measures  are  those  most  directly  predictive 
of  achievement.  Less  predictive  measures,  by  definition,  are  less 
closely  associated  with  whatever  influence  CSR  skills  have  on 
achievement. 

Second,  we  compared  the  candidate  measures  on  their  ability  to 
predict  the  gains  in  achievement  made  during  subsequent  periods. 
Children  with  better  initial  CSR  skills  may  show  higher  subse¬ 
quent  achievement,  but  that  does  not  necessarily  mean  they  also 
gain  more  during  that  period.  They  are  likely  to  have  higher 
achievement  levels  to  begin  with  and  may  simply  maintain  their 
relative  position.  If  we  expect  children  with  better  CSR  skills  to  be 
better  able  to  engage  in  the  learning  opportunities  presented  in 
pre-K  classrooms,  they  should  show  greater  gains  in  achievement 
over  the  pre-K  year  and,  similarly,  over  later  school  years.  The 
most  educationally  relevant  CSR  measures,  therefore,  should  be 
those  that  show  the  strongest  relations  to  subsequent  achievement 
gains. 
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Third,  we  compared  the  candidate  CSR  measures  on  an  even 
more  specific  kind  of  relation  with  achievement.  Pre-K  experience 
is  not  only  expected  to  affect  achievement  but  may  also  affect  CSR 
skills  themselves  such  that  those  skills  will  improve  during  the 
course  of  the  school  year.  Indeed,  learning  to  pay  attention,  stay  on 
task,  change  tasks  when  asked,  and  other  such  CSR-related  behav¬ 
iors  are  part  of  the  school  readiness  objectives  of  many  pre-K 
programs.  If  CSR  skills  are  related  to  gains  in  achievement,  then 
gains  in  CSR  skills  should,  in  turn,  be  related  to  further  gains  in 
achievement.  We  therefore  compared  the  candidate  measures  on 
the  extent  to  which  the  CSR  gains  observed  over  the  pre-K  year 
were  related  to  achievement  gains.  Those  relations  are  especially 
informative  about  the  potential  of  the  different  CSR  measures  as 
outcomes  for  research  on  pre-K  interventions  aimed  at  enhancing 
learning-related  CSR  skills.  Such  interventions  would  naturally 
want  to  target  CSR  skills  for  which  there  was  some  assurance  that 
gains  on  those  skills  were  associated  with  learning  gains. 

The  CSR  measures  most  on  target  for  use  in  pre-K  settings  when 
their  implications  for  learning  and  achievement  are  of  primary 
interest  should  be  those  that  show  the  strongest  relations  of  these 
different  kinds.  That  is,  we  would  expect  children  with  better 
learning-related  CSR  skills  not  only  to  have  higher  achievement, 
but  to  show  greater  achievement  gains  over  time,  and  if  those  CSR 
skills  improve,  to  show  correspondingly  larger  gains  in  achieve¬ 
ment.  The  more  relevant  measures  of  these  learning-related  CSR 
skills,  therefore,  are  those  that  best  demonstrate  these  relations. 

We  then  brought  two  additional  perspectives  to  the  assessment 
of  the  educational  relevance  of  the  candidate  CSR  measures.  For 
one  of  these,  we  considered  the  extent  to  which  the  measures  were 
responsive  to  developmental  change,  that  is,  showed  nontrivial 
increases  as  CSR  skills  improved  through  maturation  and  whatever 
facilitation  occurred  in  school  classrooms.  CSR  measures  that 
show  no  or  limited  increases  during  pre-K  and  subsequent  early 
grades  are  thus  relatively  insensitive  to  the  gains  young  children 
are  known  to  make  during  those  periods.  Measures  that  are  more 
sensitive  to  change  will,  by  their  very  nature,  perform  better  for 
assessing  change  and  distinguishing  children  whose  CSR  skills 
differ. 

Finally,  we  considered  the  relation  between  the  candidate  mea¬ 
sures  and  teacher  ratings  of  the  CSR-related  learning  skills  they  are 
able  to  observe  in  the  classroom,  including  persistence,  indepen¬ 
dence,  organization,  and  participation.  Teacher  ratings  of  such 
learning  skills  have  been  found  to  be  predictive  of  later  academic 
achievement  (Bodovski  &  Farkas,  2007;  Davoudzadeh  et  al.,  2015; 
Schmitt  et  al.,  2014)  and  reflect  how  CSR  skills  are  manifest  in 
children’s  classroom  behavior.  However,  these  ratings  show  dis¬ 
tinct  differences  from  the  results  of  direct  assessments  of  chil¬ 
dren’s  CSR  skills  (Fuhs,  Farran,  &  Nesbitt,  2015;  Matthews  et  al., 
2009;  Schmitt  et  al.,  2014),  and  thus  cannot  be  assumed  to  be 
equivalent  measures  of  the  underlying  CSR  skills  of  interest. 
Nonetheless,  the  candidate  CSR  measures  with  the  greatest  edu¬ 
cational  relevance  in  pre-K  settings  should  also  show  close  rela¬ 
tions  to  teacher  ratings  of  the  learning  skills  those  teachers  observe 
in  the  classroom.  Such  relations  help  establish  the  ecological 
validity  of  the  measures  for  use  in  pre-K  contexts  as  well  as  giving 
them  credibility  with  teachers  who  may  use  them. 

To  conduct  a  comparative  assessment  of  the  performance  of 
direct  assessments  of  CSR  skills  on  these  attributes,  we  selected  a 
range  of  candidate  measures  as  described  in  more  detail  below  and 


administered  them  to  a  large  sample  of  children  at  the  beginning 
and  end  of  the  pre-K  year  and  again  at  the  end  of  kindergarten.  We 
then  used  those  data  to  assess  each  measure  for  its  ability  to  predict 
achievement  and  achievement  gain,  responsiveness  to  change  over 
time,  and  correlation  with  teacher  ratings.  The  best  performing 
measures  identified  in  those  analyses  were  then  administered  to  a 
new  sample  of  children  before  and  after  the  pre-K  year  to  allow 
cross-validation  of  the  findings  from  the  initial  sample  and  support 
collection  of  test-retest  reliability  data.  The  procedural  details  and 
results  are  described  in  the  sections  that  follow. 

> 

Method 

To  identify  candidate  measures,  we  first  reviewed  the  literature 
on  executive  function,  effortful  control,  attention,  and  self¬ 
regulation  in  an  attempt  to  delineate  the  range  of  skills  likely  to  be 
relevant  to  learning-related  CSR.  The  skill  domains  distinguished 
for  this  purpose  were 

1 .  Sustained  attention-attending  to  and  sustaining  focus  on 
a  task. 

2.  Attention  shifting — shifting  focus  within  or  between 
tasks  as  situations  demand. 

3.  Working  memory — active  maintenance  and  manipula¬ 
tion  of  information  in  memory. 

4.  Inhibitory  control — volitional  inhibition  of  a  prepotent 
response  in  order  to  complete  a  task. 

5.  Effortful  control — suppression  of  impulsive  or  premature 
responses  when  required  by  a  task. 

We  then  reviewed  a  wide  range  of  CSR-related  measures  that 
have  appeared  in  research  with  pre-K  age  children  (a  list  of  those 
measures  is  in  Appendix  A).  We  categorized  each  according  to  the 
skill  domain  that  seemed  most  central  to  accomplishing  the  tasks 
the  measure  presented,  relying  heavily  on  the  description  of  the 
measure  in  the  associated  literature.  Of  course,  none  of  these  are 
pure  measures  of  the  skills  indicated  by  the  labels  we  applied  to  the 
respective  domains;  they  all  tap  into  multiple  overlapping  skills. 
But  sorting  them  this  way  and  selecting  at  least  one  measure  from 
each  category  ensured  that  we  would  end  up  with  a  diverse  set  that 
collectively  should  span  the  full  range  of  CSR  skill  domains 
identified  in  research  on  this  topic.  When  making  these  selections, 
we  prioritized  measures  previously  shown  to  be  related  to  aca¬ 
demic  achievement  and  those  we  judged  to  be  most  practical  for 
administration  in  classroom  settings  without  the  need  for  computer 
support  or  specialized  equipment.  Through  this  process,  we  iden¬ 
tified  10  candidate  measures  that  yield  12  indices  of  CSR  (two 
measures  assess  both  accuracy  and  reaction  time  [RT]),  which  are 
described  below. 

Sustained  Attention 

For  assessment  tasks  requiring  the  capacity  to  maintain  focus 
and  attention,  we  chose  Copy  Design  (Davie,  Butler,  &  Gold¬ 
stein,  1972;  Osborn,  Butler,  &  Morris,  1984)  and  the  Kansas 
Reflection-Impulsivity  Scale  for  Preschoolers  (KRISP;  Wright, 
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1971).  For  Copy  Design,  children  copy  eight  geometric  designs 
of  increasing  difficulty  and,  for  each,  the  quality  of  the  best 
attempt  is  scored  0  or  1  by  defined  criteria  with  total  scores 
ranging  from  0  to  8.  Cronbach  alphas  for  this  measure  were  .79 
in  the  data  we  collected  on  our  sample  (described  below)  at  the 
beginning  of  the  pre-K  year  and  .75  in  the  data  collected  at  the 
end  of  the  year. 

The  KRISP  presents  children  with  a  series  of  drawings  for 
which  they  must  identify  the  duplicate  of  a  target  picture  from  4-6 
other  pictures,  all  but  one  different  in  minor  ways.  Each  of  12  trials 
is  scored  for  number  of  errors  and  RT  to  selection  of  the  first 
drawing.  Accuracy  is  scored  as  the  number  of  errors  subtracted 
from  the  total  errors  possible  (36).  RT  is  scored  as  the  difference 
between  the  mean  for  the  5  hardest  and  7  easiest  trials  divided  by 
the  mean  for  the  hardest  ones,  thus  indexing  how  much  the  child 
slowed  down  to  reflect  on  the  harder  items.  Cronbach  alphas  for 
accuracy  were  .66  at  the  beginning  of  pre-K  and  .63  at  the  end. 

Attention  Shifting 

For  measures  requiring  the  ability  to  shift  focus  from  one  task  to 
another,  we  selected  the  Dimensional  Change  Card  Sort  (DCCS; 
Zelazo,  2006).  Children  sort  a  set  of  cards  according  to  one 
dimension  (color),  and  then  according  to  a  different  dimension 
(shape).  If  they  are  largely  successful  with  that  switch,  they  are 
given  similar  cards  with  a  black  border  around  some  and  asked  to 
sort  by  color  if  the  card  has  a  border  and  by  shape  if  not.  Children 
receive  a  score  of  0  if  they  do  not  pass  the  initial  color  sort,  1  if 
they  pass  the  color  but  not  the  shape  sort,  2  if  they  pass  the  shape 
sort,  and  3  if  they  also  pass  the  border  sort.  Cronbach  alphas  for 
color  sorting  were  .81  for  data  at  the  beginning  of  pre-K  and  .78  at 
the  end;  for  shape  sorting,  .96  at  the  beginning  of  pre-K  and  .92  at 
the  end.  Too  few  children  were  able  to  complete  the  border  task  to 
allow  alpha  values  to  be  computed. 

Working  Memory 

For  assessment  tasks  that  require  the  ability  to  temporarily  store 
and  manage  information,  we  selected  Operation  Span  (Blair  & 
Willoughby,  2006f)  and  Backwards  Digit  Span  (Davis  &  Pratt, 
1995).  For  Operation  Span,  children  are  shown  pictures  of  houses 
with  animals  and  colors  and  asked  to  name  them,  then  recall  the 
animal  in  each  house  on  a  second  display  of  empty  houses.  Six 
trials  with  two,  three,  or  four  items  to  remember  are  scored  0  for 
incorrect  and  1  for  a  correct  response,  with  the  sum  as  the  final 
score  (range  0  to  18).  Cronbach  alphas  were  .77  at  the  beginning 
of  pre-K  year  and  .64  at  the  end. 

Backwards  Digit  Span  (Davis  &  Pratt,  1995)  asks  children  to 
remember,  then  reverse  a  series  of  numbers  presented  orally;  for 
example,  given  1,  3,  the  child  is  to  respond  3,  1 .  Across  six  trials 
with  increasing  numbers  of  digits,  each  number  recalled  correctly 
in  backward  sequence  is  scored  1  with  the  final  score  as  the  sum 
of  digits  correctly  recalled.  In  the  pre-K  year,  too  few  children 
were  able  to  complete  a  sufficient  number  of  items  for  Cronbach’ s 
alpha  to  be  computed. 

Inhibitory  Control.  For  measures  that  require  the  ability  to 
suppress  a  prepotent  response  in  order  to  complete  a  task,  we 
selected  Head-Toes-Knees-Shoulders  (HTKS;  Ponitz  et  al.,  2009), 
Peg  Tapping  (Diamond  &  Taylor,  1996),  and  Spatial  Conflict 


(Blair  &  Willoughby,  2006e).  HTKS  asks  a  child  to  respond  to  oral 
prompts  of  “touch  your  head”  and  “touch  your  toes”  by  doing  the 
opposite  for  10  trials.  If  responses  on  five  or  more  are  correct,  two 
new  prompts  are  added  for  another  10  trials.  Each  trial  is  scored  0 
for  an  incorrect  response,  1  for  an  incorrect  motion  that  was 
corrected,  and  2  for  a  correct  response  with  the  sum  across  all 
items  as  the  final  score  (range  0  to  40).  Cronbach  alphas  for  the 
first  10  trials  were  .96  at  the  beginning  of  pre-K  and  the  same  at 
the  end;  for  the  second  10,  they  were  .85  at  the  beginning  and  .88 
at  the  end. 

The  Peg  Tapping  task  asks  children  to  tap  once  when  the 
examiner  taps  twice  and  twice  when  the  examiner  taps  once 
(Diamond  &  Taylor,  1996).  Children  largely  successful  in  practice 
trials  then  have  1 6  test  trials  scored  0  for  incorrect  and  1  for  correct 
responses.  Final  scores  range  from  —1  to  16,  with  —1  assigned  if 
the  child  does  not  reach  criterion  in  the  practice  trials.  Cronbach 
alphas  were  .87  in  data  at  the  beginning  of  pre-K  year  and  .88  at 
the  end. 

The  Spatial  Conflict  task  (Blair  &  Willoughby,  2006e)  was  a 
paper  adaptation  of  the  computer-based  version  (Gerardi-Caulton, 
2000).  Children  are  given  a  card  with  one  button  on  the  right-hand 
side  and  one  on  the  left,  and  shown  a  series  of  arrows  that  point 
either  left  or  right.  They  are  asked  to  touch  the  button  on  the  side 
the  arrow  points  to  using  their  right  hand  for  the  button  on  the  right 
and  their  left  hand  for  the  one  on  the  left.  A  series  of  congruent 
trials  (arrow  on  the  same  side  of  the  page  it  points  to),  is  followed 
by  16  mixed  congruent  and  incongruent  trials  scored  0  for  the 
incorrect  button,  1  for  the  correct  button  with  the  wrong  hand,  and 
2  for  the  correct  button  with  the  correct  hand,  with  the  total  score 
ranging  from  0  to  32.  Cronbach  alphas  were  .82  for  data  from  the 
beginning  of  pre-K  and  .77  at  the  end. 

Effortful  Control 

For  assessment  tasks  that  require  the  ability  to  suppress  impulsive 
or  premature  responses,  we  selected  the  Whisper  and  Turtle-Rabbit 
tasks  (Kochanska,  Murray,  Jacques,  Koenig,  &  Vandegeest,  1996).  In 
the  Whisper  task  children  are  shown  pictures  of  12  cartoon  characters 
and  asked  to  whisper  their  names.  The  cartoon  characters  vary  in 
familiarity,  providing  the  opportunity  for  the  child  to  act  impulsively 
(shout)  when  a  very  recognizable  one  comes  up.  Each  trial  is  scored 
0  for  a  shout,  1  for  a  normal  voice,  2  for  no  response,  and  3  for  a 
whisper  (range  0  to  36).  Cronbach  alphas  were  .96  at  the  beginning  of 
pre-K  and  .95  at  the  end. 

The  Turtle-Rabbit  task  (Kochanska  et  al.,  1996)  presents  chil¬ 
dren  with  a  drawing  of  a  curved  path  with  five  bends  and  they  are 
asked  to  move  toy  figures  along  the  path  without  straying.  After 
baseline  trials  with  neutral  figures,  they  are  given  two  trials  with  a 
rabbit  they  are  told  is  fast,  and  two  with  a  turtle  they  are  told  is 
slow.  Each  curve  is  scored  0  if  bypassed,  1  if  the  figure  is  above 
the  mat  but  follows  the  general  curvature,  and  2  if  the  figure  stays 
on  the  mat  and  within  the  path.  Time  to  complete  each  trial  is  also 
recorded.  Accuracy  is  scored  as  the  total  for  all  curves  and  trials 
(range  0  to  60).  Reaction  time  is  scored  as  the  difference  between 
the  mean  times  for  the  turtle  and  rabbit  trials.  Cronbach  alphas  for 
accuracy  were  .99  for  both  rabbit  and  turtle  at  the  beginning  of 
pre-K,  and  .92  for  rabbit  and  .89  for  turtle  at  the  end  of  pre-K. 
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Teacher  Ratings  of  Cognitive  Self-Regulation 

Teacher  rating  scales  for  children’s  behaviors  in  the  classroom 
were  selected  to  mirror  as  much  as  possible  the  aspects  of  CSR 
identified  in  our  initial  literature  review  and  assessed  in  the  can¬ 
didate  direct  child  measures.  The  following  subscales  were  com¬ 
bined  in  a  single  rating  form. 

Persistence.  The  Persistence  subscale  of  the  Temperament  As¬ 
sessment  Battery  for  Children  (TABC;  Martin,  1988)  assesses  each 
child’s  ability  to  sustain  attention.  The  eight  items  on  this  subscale  are 
rated  on  a  1  ( hardly  ever)  to  7  ( almost  always )  scale  and  include  such 
behaviors  as  “child  can  continue  at  the  same  activity  for  an  hour”  and 
“if  child’s  activity  is  interrupted,  he/she  tries  to  go  back  to  the 
activity.”  Cronbach  alphas  for  this  subscale  were  .75  at  the  beginning 
of  pre-K  and  .74  at  the  end. 

Distractibility.  The  Distractibility  subscale  of  the  TABC  as¬ 
sesses  the  ability  to  ignore  distractions.  The  eight  items  on  this 
subscale  are  rated  as  described  above  and  cover  such  behaviors  as 
“Child  is  easily  drawn  away  from  his/her  work  by  noises  .  .  .  etc.” 
and  “If  other  children  are  talking  or  making  noise  while  teacher  is 
explaining  a  lesson,  this  child  remains  attentive  to  the  teacher.” 
Cronbach  alphas  were  .89  at  the  beginning  of  pre-K  and  .90  at  the 
end. 

Impulsivity.  This  was  assessed  with  the  Impulsivity  subscale 
of  the  Children’s  Behavioral  Questionnaire  (CBQ;  Rothbart, 
Ahadi,  Hershey,  &  Fisher,  2001).  CBQ  items  are  rated  from  1 
0 extremely  untrue  of  student)  to  7  ( extremely  true).  The  13  items 
cover  such  behavior  as  “sometimes  interrupts  others  when  they  are 
speaking”  and  “usually  stops  and  thinks  things  over  before  decid¬ 
ing  to  do  something.”  Cronbach  alphas  were  .87  at  the  beginning 
of  pre-K  and  .88  at  the  end. 

Attention  shifting.  The  CBQ  Attention  Shifting  subscale  was 
used  for  this  dimension.  Twelve  items  are  also  rated  as  described 
above  and  include  such  behaviors  as  “needs  to  complete  one 
activity  before  being  asked  to  start  on  another  one”  and  “can  easily 
shift  from  one  activity  to  another.”  Cronbach  alphas  were  .87  at  the 
beginning  of  pre-K  and  .89  at  the  end. 

Work-related  skills.  A  scale  that  spanned  a  variety  of  chil¬ 
dren’s  CSR  skills  as  observed  in  the  classroom  was  also  included 
in  the  teacher  rating  form — the  Work-Related  Skills  subscale  of 
the  Cooper-Farran  Behavior  Rating  Scale  (CFBR;  Cooper  &  Far- 
ran,  1988).  The  16  items  on  this  scale  ask  about  children’s  inde¬ 
pendent  work,  compliance  with  instructions,  memory  for  instruc¬ 
tions,  and  completion  of  games  and  activities.  Items  are  rated  from 
1  to  7  using  behavioral  anchors  distinctive  to  each  item.  Cronbach 
alphas  were  .95  at  the  beginning  and  end  of  pre-K. 

Academic  Achievement  Measures 

Achievement  was  measured  with  five  subscales  from  the  Wood¬ 
cock  Johnson  III  achievement  battery  (Woodcock,  McGrew,  & 
Mather,  2001)  widely  used  in  early  childhood  education  research. 
These  included  two  math  subtests:  Applied  Problems  (numerical 
and  spatial  problems)  and  Quantitative  Concepts  (numbers,  se¬ 
quencing,  shapes,  and  symbols).  Language  and  literacy  skills  were 
assessed  with  Letter-Word  Identification  (identify  and  pronounce 
letters  and  read  words),  Picture  Vocabulary  (name  objects  in 
pictures  and  point  to  the  picture  that  goes  with  a  word),  and  Oral 
Comprehension  (complete  an  orally  presented  passage  by  provid¬ 
ing  the  appropriate  missing  word).  Data  analysis  used  the  IRT- 


scaled  W-scores,  but  standard  scores  (mean  of  100,  standard 
deviation  of  15)  are  more  descriptive  and  showed  fall  pre-K 
baseline  mean  values  for  the  pre-K  sample  of  98  on  Applied 
Problems,  90  on  Quantitative  Concepts,  104  on  Letter-Word  Iden¬ 
tification,  100  on  Picture  Vocabulary,  and  97  on  Oral  Comprehen¬ 
sion. 

Participants  and  Assessment  Procedure 

Parental  consent  was  obtained  for  608  children  recruited  from 
58  pre-K  classrooms  in  32  schools/centers  across  four  school 
systems  and  five  community  childcare  centers  in  middle  Tennes¬ 
see.  The  consent  rate  was  60%  (range  13%  to  100%  across 
classrooms).  Consented  children  identified  as  English  Language 
Learners  were  screened  for  English  proficiency  using  the  Pre-LAS 
(Duncan  &  DeAvila,  1985).  Thirty-six  children  did  not  pass  the 
Pre-LAS,  5  did  not  assent,  and  32  moved  before  the  study  ended, 
leaving  535  children  in  the  final  analytic  sample. 

Participating  schools/centers  were  in  urban,  suburban,  and  rural 
settings  and  provided  a  racially  and  economically  diverse  sample 
of  children.  Although  information  about  race  and  economic  status 
was  not  available  for  individual  children,  aggregate  data  for  the 
schools/centers  showed  proportions  of  African  American  children 
that  ranged  from  0%  to  87%  (M  =  16%),  Hispanic  children  from 
2%  to  34%  (M  =  11%),  and  non-Hispanic  White  children  from 
13%  to  95%  (M  =  71%).  Economic  diversity  was  indicated  by  a 
range  of  children  qualifying  for  free  or  reduced  price  lunch  pro¬ 
grams  from  16%  to  100%  ( M  =  55%).  The  children  in  the  analytic 
sample  ranged  in  age  from  3.8  to  5.4  (M  =  4.6)  at  the  beginning 
of  pre-K  and  52%  were  male. 

Procedure.  Children  were  assessed  twice  during  the  pre-K 
year — near  the  beginning  (early  September  through  October)  and 
the  end  (mid-March  to  early  May),  referred  to  as  Time  1  and  Time 
2,  respectively.  They  were  assessed  again  at  the  end  of  kindergar¬ 
ten  (mid-March  to  early  May;  Time  3).  Time  1  and  2  assessments 
were  administered  in  three  sessions  of  20-30  min  with  nearly  all 
sessions  occurring  within  10  or  fewer  weeks.  Time  3  assessments 
were  administered  in  two  sessions  spanning  fewer  than  five  days 
on  average.  Each  child  was  assessed  individually  in  a  quiet  area 
away  from  the  classroom  with  a  varying  order  for  sessions  but  a 
fixed  order  for  the  measures  within  a  session.  In  pre-K,  the 
sessions  included  (a)  Operation  Span,  Whisper,  Peg  Tapping,  and 
WJ-III  Applied  Problems  and  Quantitative  Concepts;  (b)  DCCS, 
HTKS,  Digit  Span,  Copy  Design,  and  WJ-III  Picture  Vocabulary; 
and  (c)  Spatial  Conflict,  Turtle- Rabbit,  KRISP,  and  WJ-III  Letter- 
Word  Identification  and  Oral  Comprehension.  Based  on  the  find¬ 
ings  from  pre-K,  a  reduced  set  of  measures  was  administered  in  the 
two  sessions  at  the  end  of  kindergarten:  (a)  Peg  Tapping,  HTKS, 
Copy  Design,  and  WJ-III  Applied  Problems,  Quantitative  Con¬ 
cepts,  and  Picture  Vocabulary;  and  (b)  DCCS,  KRISP,  Digit  Span, 
and  WJ-III  Letter- Word  Identification  and  Oral  Comprehension. 

Teacher  ratings  were  made  at  approximately  the  same  times  as 
the  child  assessments  near  the  beginning  and  end  of  the  pre-K  year. 
Kindergarten  teachers  then  completed  the  same  rating  scales  near 
the  end  of  the  kindergarten  year. 

Missing  data.  Of  the  535  children  who  comprised  the  initial 
pre-K  analytic  sample,  47  could  not  be  located  for  the  Time  3  end 
of  kindergarten  assessments,  leaving  488  children  in  the  follow-up 
sample.  The  children  missing  Time  3  data  were  compared  with 
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those  providing  data  on  the  available  demographic  variables  and 
the  T1  and  T2  CSR  and  achievement  measures.  T  tests  with 
Benjamini-Hochberg  corrections  for  the  large  number  of  multiple 
comparisons  showed  no  significant  differences  between  children 
assessed  and  not  assessed  in  kindergarten.  Given  no  indications 
that  the  missing  cases  made  the  follow-up  sample  unrepresentative 
of  the  initial  sample,  analyses  with  pre-K  data  were  conducted  on 
the  analytic  sample  of  535  children  while  those  with  kindergarten 
data  were  conducted  on  the  sample  of  488. 

Cross-validation  sample  and  assessment  procedure.  The 
cross-validation  sample  was  drawn  from  a  later  cohort  of  children 
enrolled  in  pre-K  in  the  four  school  systems  that  provided  most  of 
the  original  sample.  These  children  were  assessed  three  times 
during  the  pre-K  year — near  the  beginning  (Time  1),  approxi¬ 
mately  2  weeks  later  (retest)  to  assess  the  test-retest  reliability  of 
the  measures,  and  near  the  end  of  the  school  year  (Time  2). 
Parental  consent  was  obtained  for  593  children  from  43  classrooms 
in  23  schools  (overall  consent  rate  of  69%).  To  accommodate 
limited  resources  for  individual  testing,  only  10  consented  children 
were  randomly  selected  from  classrooms  with  more  than  10.  This 
procedure  produced  a  sample  of  416  children,  but  21  did  not  pass 
the  Pre-LAS  screen  for  English  proficiency,  four  did  not  assent  to 
the  assessments,  18  moved  prior  to  the  reliability  retest,  and  4  were 
withdrawn  due  to  assessor  error.  This  left  369  children  in  the 
sample  for  the  test-retest  reliability  data  collected  in  the  fall  of  the 
pre-K  year.  After  that,  13  children  moved  before  the  end  of  pre-K, 
leaving  356  in  the  sample  with  data  from  both  the  beginning  and 
end  of  the  pre-K  year. 

The  mean  age  of  the  children  in  both  the  test-retest  and  final 
samples  was  4.4  years  and  53%  were  male.  As  in  the  initial 
sample,  the  schools  from  which  these  children  were  drawn  were 
economically  and  racially  diverse:  the  proportion  of  students  at 
each  school  qualifying  for  free  or  reduced  price  lunch  ranged  from 
26%  to  95%  (M  =  52%);  the  proportion  who  were  African 
American  ranged  from  0%  to  49%  (M  =  12%),  the  proportion 
Hispanic  ranged  from  1%  to  38%  (M  =  9%),  and  the  proportion 
non-Hispanic  white  ranged  from  33%  to  97%  (M  =  75%). 

At  Times  1  and  2,  there  were  two  assessment  sessions,  one  for 
CSR  and  one  for  achievement.  The  order  of  these  sessions  varied, 
but  the  measures  were  administered  in  fixed  order  at  each  session. 
Only  CSR  measures  were  administered  at  Retest.  In  addition,  at 
Time  1 ,  Retest,  and  Time  2,  teachers  completed  ratings  on  selected 
CSR  measures  (described  later).  The  majority  (74%)  of  these 
teachers  had  also  participated  in  the  initial  phase  of  this  study. 

Results 

Analysis  of  the  data  described  above  was  organized  to  compare 
the  12  candidate  CSR  measures  with  regard  to  their  performance  in 
the  three  areas  described  earlier  that  we  judged  to  be  especially 
pertinent  to  applications  in  pre-K  settings  where  relevance  to 
academic  achievement  is  a  major  concern:  (a)  their  predictive 
ability  for  academic  achievement,  (b)  responsiveness  to  develop¬ 
mental  change,  and  (c)  concurrence  with  teacher  ratings. 

Predictive  Ability  for  Academic  Achievement 

The  most  important  consideration  for  our  purposes  in  assessing 
the  CSR  measures  was  their  ability  to  predict  academic  achieve¬ 


ment,  measured  here  with  the  WJ-III  Quantitative  Concepts,  Ap¬ 
plied  Problems,  Oral  Comprehension,  Picture  Vocabulary,  and 
Letter-Word  Identification  subtests.  The  intercorrelations  among 
these  five  subtests  at  Times  1,  2,  and  3  were  positive  and  relatively 
high,  and  principal  components  analyses  showed  strong  one-factor 
solutions  with  loadings  from  .61  to  .84.  To  represent  overall 
academic  achievement,  therefore,  we  created  a  composite  score  for 
each  time  of  measurement  by  combining  the  W-scores  across  the 
five  subscales  for  each  child  with  each  subtest  given  equal  weight. 

CSR  predicting  achievement.  The  most  direct  answer  to  the 
question  of  the  relative  strength  of  the  relation  between  each  of  the 
selected  CSR  measures  and  later  achievement  is  obtained  by 
comparing  their  correlations  at  each  time  of  measurement  with 
achievement  measured  at  a  later  time.  To  address  this  question  we 
first  standardized  the  WJ  composite  achievement  measure  and 
each  of  the  CSR  measures  separately  for  each  time  of  testing  so 
that  the  magnitude  of  the  respective  relations  could  be  easily 
compared.  We  then  constructed  multilevel  regression  models  in 
which  each  CSR  measure  in  turn  was  used  as  the  sole  predictor  of 
achievement  at  a  later  time.  Multilevel  analysis  was  necessary  to 
respect  the  structure  of  the  data  and  ensure  that  standard  errors 
were  properly  estimated;  it  was  conducted  with  SPSS  23  Mixed 
Models  with  children  nested  in  classrooms,  classrooms  in  schools 
(three  levels),  and  both  classrooms  and  schools  treated  as  random 
effects.  All  the  time  intervals  available  in  our  data  were  examined: 
predicting  from  Time  1  (beginning  of  pre-K)  to  Time  2  (end  of 
pre-K)  and  Time  3  (end  of  kindergarten),  and  predicting  from 
Time  2  to  Time  3. 

Table  1  reports  the  standardized  regression  coefficients  esti¬ 
mated  in  each  of  these  analyses.  Because  all  the  variables  were 
standardized  and  there  was  only  one  predictor  in  each  analysis, 
these  coefficients  can  be  read  as  zero-order  product-moment  cor¬ 
relation  coefficients.  All  these  correlations  were  statistically  sig¬ 
nificant  with  the  largest  found  for  Backwards  Digit  Span,  Copy 
Design,  DCCS,  HTKS,  KRISP  Accuracy,  and  Peg  Tapping.  From 
the  beginning  of  pre-K  to  the  end  of  pre-K  (Time  2)  and  then  to  the 
end  of  kindergarten  (Time  3),  the  correlations  for  those  CSR 
measures  ranged  from  .37  to  .56.  From  the  end  of  pre-K  (Time  2) 
to  the  end  of  kindergarten,  they  ranged  from  .38  to  .55. 

CSR  predicting  achievement  gain.  The  analyses  reported 
above  show  that  children  with  better  initial  skills  on  the  CSR 
measures  show  higher  achievement  levels  at  a  later  time,  but 
those  children  also  have  higher  achievement  to  begin  with — the 
concurrent  CSR-achievement  correlations  at  Time  1  and  Time  2 
for  the  best  CSR  measures  in  Table  1  ranged  from  .42  to  .59. 
The  next  set  of  analyses  therefore  addressed  the  further  ques¬ 
tion  of  the  relative  ability  of  the  CSR  measures  to  predict  the 
achievement  gains  made  over  a  subsequent  period.  Our  interest 
is  in  achievement  gains  associated  with  the  experiences  children 
have  over  a  school  year,  not  the  portion  predictable  from  their 
initial  achievement  levels  prior  to  those  experiences.  For  these 
analyses,  we  used  the  same  3-level  regression  models  described 
above,  but  with  the  Time  1  WJ  composite  variable  included  as 
a  covariate  in  each  analysis  along  with  the  respective  CSR 
measure.  The  CSR  measures  in  these  analyses,  therefore,  were 
predicting  residual  gain  in  achievement;  that  is,  later  achieve¬ 
ment  with  initial  achievement  held  constant  (Cronbach  & 
Furby,  1970). 
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Table  1 

Standardized  Regression  Coefficients  Between  Each  of  the  Cognitive  Self- Regulation  (CSR)  Measures  and  Later  Academic 


Achievement  for  the  Initial  and  Cross-Validation  Samples 


Initial  sample  ( n  =  535) 

Cross-validation 
sample  ( n  =  356) 

CSR  measure 

Time  1  CSR  &  Time 

2  Achievement 

'  Time  1  CSR  &  Time  3 
Achievement 

Time  2  CSR  &  Time 

3  Achievement 

Time  1  CSR  &  Time  2 
Achievement 

Backwards  Digit  Span 

.42 

.37 

.47 

.46 

Copy  Design 

.41 

.40 

.38 

.40 

DCCS 

.45 

.44 

.42  v 

.50 

HTKS 

.50 

.49 

.55 

.52 

KRISP  Accuracy 

.48 

.50 

.43 

.46 

KRISP  Reaction  Time 

.25 

.23 

.21 

- 3 

Operation  Span 

.26 

.27 

.21 

— 

Peg  Tapping 

.56 

.51 

.52 

.58 

Spatial  Conflict 

.29 

.27 

.18 

— 

Turtle-Rabbit  Accuracy 

.22 

.23 

.18 

— 

Turtle-Rabbit  Reaction  Time 

.32 

.31 

.26 

— 

Whisper  Task 

CSR  factor  score 

.37 

.36 

.25 

.79 

Note.  N  =  488  at  Time  3.  All  correlations  are  statistically  significant  at  p  <  .01  in  multilevel  analysis.  DCCS  =  Dimensional  Change  Card  Sort;  HTKS  = 
Head  Toes  Knees  Shoulders;  KRISP  =  Kansas  Reflection-Impulsivity  Scale  for  Preschoolers.  Academic  achievement  is  the  composite  measure  combining 
five  Woodcock-Johnson  subscales.  Time  1  =  beginning  of  pre-K;  Time  2  =  end  of  pre-K;  Time  3  =  end  of  kindergarten.  The  CSR  factor  score  is  based 
on  the  six  individual  CSR  measures  shown  for  the  cross-validation  sample. 
a  Measure  not  included  in  cross-validation. 


The  first  three  columns  of  Table  2  show  standardized  regression 
coefficients  from  these  analyses.  It  is  not  surprising  that  they  are 
relatively  small  given  the  strong  relation  between  initial  and  later 
achievement.  Nonetheless,  many  of  the  CSR  measures  had  statisti¬ 
cally  significant  predictive  relations  with  achievement  gain  from  the 


beginning  to  the  end  of  pre-K  and  to  the  end  of  kindergarten,  as  well 
as  from  the  end  of  pre-K  to  the  end  of  kindergarten.  The  measures 
with  significant  positive  predictive  relations  for  all  three  intervals,  at 
least  at  p  <  .10,  were  Backwards  Digit  Span,  Copy  Design,  HTKS, 
KRISP  Accuracy,  and  Peg  Tapping. 


Table  2 


Standardized  Regression  Coefficients  for  the  Relation  Between  Each  Cognitive  Self-Regulation  (CSR)  Measure  and  Residual  Gain  on 
the  Academic  Achievement  Composite  for  the  Initial  and  Cross-Validation  Samples 


Initial  sample 
(n  =  535) 

Cross-validation  sample 
(n  =  356) 

CSR  measure 

Time  1  CSR 
&  T1-T2 
Ach  Gain 

Time  1  CSR 
&  T1-T3 
Ach  Gain 

Time  2  CSR  T1-T2  CSR  Gain 
&  T2-T3  &  T1-T2  Ach 

Ach  Gain  Gain 

T1-T2  CSR  Gain 
&  T1-T3  Ach 
Gain 

Time  1  CSR 
&  T1-T2 
Ach  Gain 

T1-T2  CSR  Gain 
&  T1-T2  Ach 
Gain 

Backwards  Digit  Span 

.06* 

.05f 

.08* 

.12’ 

.14* 

.05 

.06+ 

Copy  Design 

.12’ 

.12* 

.05* 

.07* 

.06* 

.05 

.10* 

DCCS 

.07* 

.10* 

.04 

.10* 

.06* 

.11* 

.10’ 

HTKS 

.10* 

.08* 

.13’ 

.09’ 

.14’ 

,06+ 

.08* 

KRISP  Accuracy 

.09* 

.17* 

.10* 

.09* 

.08* 

.09* 

.09* 

KRISP  Reaction  Time 

.09* 

.09* 

.02 

.05’ 

,05t 

a 

Operation  Span 

.07* 

.09* 

.01 

.05* 

.02 

__ 

Peg  Tapping 

.09* 

.09* 

.05 1 

.11* 

.07* 

.10* 

.15* 

Spatial  Conflict 
Turtle-Rabbit 

.06* 

.06* 

.03 

.05* 

.05+ 

Accuracy 

Turtle-Rabbit  Reaction 

.03 

.05 f 

-.02 

.08* 

.03 

— 

Time 

.03 

.05f 

.07* 

.01 

.04 

Whisper  Task 

CSR  factor  score 

.06* 

.07* 

-.05* 

.09* 

-.01 

.18* 

.16* 

Note.  N  =  488  at  T3.  Ach  =  Achievement;  DCCS  =  Dimensional  Change  Card  Sort;  HTKS  =  Head  Toes  Knees  Shoulders-  KRISP  =  Kansas 
Reflection-Impulsivity  Scale  for  Preschoolers;  RT  =  Reaction  Time.  Academic  achievement  is  the  composite  measure  combining  five  Woodcock-Johnson 
subscales.  Time  1  =  beginning  of  pre-K;  Time  2  =  end  of  pre-K;  Time  3  =  end  of  kindergarten.  The  CSR  factor  score  is  based  on  the  6  individual  CSR 
measures  shown  for  the  cross-validation  sample. 
a  Measure  not  included  in  cross-validation. 
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CSR  gain  predicting  achievement  gain.  The  last  set  of  anal¬ 
yses  addressing  CSR-achievement  relations  compared  the  CSR  mea¬ 
sures  with  regard  to  the  extent  to  which  the  CSR  skill  gains  they 
showed  during  the  pre-K  year,  and  between  the  end  of  pre-K  and  end 
of  kindergarten,  were  correlated  with  achievement  gains  made  over 
those  same  periods.  For  this  gain-with-gain  analysis,  we  first  used  the 
same  three-level  regression  models  described  earlier  to  estimate  re¬ 
sidual  gain  for  each  CSR  measure  over  the  respective  periods  by 
predicting  later  CSR  scores  from  the  initial  (Time  1)  values  on  the 
same  CSR  measure.  The  residuals  from  those  analyses,  representing 
the  changes  in  the  CSR  measure  that  cannot  be  predicted  from  their 
initial  status,  are  residual  gain  scores  for  the  CSR  measures.  Those 
residual  gain  scores  were  then  used  as  independent  variables  in  a 
second  series  of  multilevel  regression  analyses  in  which  each  CSR 
residual  gain  score  was  used  to  predict  later  achievement  with  initial 
achievement  controlled,  the  analysis  model  used  above  to  examine 
residual  gain  on  achievement. 

The  fourth  and  fifth  columns  of  Table  2  report  the  standardized 
regression  coefficients  from  these  analyses.  As  in  the  previous 
analysis,  these  coefficients  are  relatively  small  because  the  much 
larger  relations  between  prepost  CSR  and  prepost  achievement 
have  been  adjusted  out  of  the  results.  The  relations  of  CSR  residual 
gain  during  pre-K  with  residual  achievement  gain  during  that  same 
year,  and  with  residual  achievement  gain  between  the  beginning  of 
pre-K  and  the  end  of  kindergarten,  are  nonetheless  statistically 
significant  for  many  of  the  CSR  measures.  The  better  performing 
CSR  measures  across  these  various  intervals,  as  indicated  by  the 
pattern  of  statistical  significance,  were  Backwards  Digit  Span, 
Copy  Design,  DCCS,  HTKS,  KRISP  Accuracy,  and  Peg  Tapping. 

Responsiveness  to  Developmental  Change 

The  last  set  of  analyses  reported  above  demonstrated  that  resid¬ 
ual  gain  on  some  of  the  CSR  measures  was  significantly  related  to 
residual  gain  on  the  achievement  measures.  However,  those  anal¬ 
yses  do  not  directly  address  the  question  of  how  much  change 
there  is  on  each  CSR  measure  during  the  pre-K  year.  As  noted 


earlier,  the  most  educationally  relevant  CSR  measures  are  those 
capable  of  showing  the  most  growth  during  the  pre-K  year.  To 
examine  the  responsiveness  of  the  measures  to  developmental 
change,  children’s  scores  on  each  CSR  measures  at  the  beginning 
of  pre-K  were  compared  to  their  scores  at  the  end  of  the  year. 
These  analyses  were  conducted  with  four-level  regression  models 
in  which  a  dummy  code  for  time  predicted  each  CSR  score  with 
Time  1  and  Time  2  scores  nested  within  children  and  children 
nested  within  classrooms  and  schools.  The  CSR  scores  were  not 
standardized  for  this  analysis,  allowing  estimation  of  the  mean 
scores  at  Time  1  (time  =  0)  and  Time  2  (time  =  1)  in  the  original 
metric.  Table  3  shows  the  means  and  the  standard  deviations  for 
each  CSR  measure.  The  difference  between  children’s  perfor¬ 
mance  at  Time  1  and  Time  2,  indexed  by  the  regression  coefficient 
on  the  time  dummy  code,  was  statistically  significant  for  all  the 
CSR  measures  except  Turtle-Rabbit  Accuracy.  Pre-post  standard¬ 
ized  mean  difference  effect  sizes  are  also  shown  in  Table  3,  computed 
as  the  Time  2  mean  minus  the  Time  1  mean  divided  by  the  pooled 
standard  deviation.  These  effect  sizes  for  all  the  measures  other  than 
Turtle-Rabbit  accuracy  were  positive  and  ranged  from  .31  to  .69,  with 
the  greatest  gains  for  Copy  Design,  DCCS,  HTKS,  KRISP  Accuracy, 
and  Peg  Tapping  (effect  sizes  greater  than  0.50). 

Table  3  also  shows  the  zero-order  product-moment  correlations 
between  children’s  scores  at  the  beginning  and  end  of  pre-K.  These 
were  all  statistically  significant  and  ranged  from  .12  to  .66.  The  largest 
of  them  showed  reasonable  consistency  in  children’s  relative  ranking 
over  the  school  year.  Nevertheless,  they  were  not  so  large  as  to 
indicate  that  only  stable  individual  differences  are  reflected  in  these 
CSR  measures  with  no  room  for  influence  from  differential  experi¬ 
ences  in  and  out  of  the  classroom  during  this  period. 

Concurrence  With  Teacher  Ratings 

To  investigate  the  relation  between  the  CSR  measures  and 
teacher’s  ratings  of  CSR-related  behavior  in  the  classroom,  we 
examined  the  correlations  between  each  CSR  measure  and  each 
of  the  five  teacher  rating  scales  (CFBR  Work  Related  Skills, 


Table  3 


Change  in  Scores  on  the  Cognitive  Self-Regulation  (CSR)  Measures  from  the  Beginning  (Time  1) 
to  End  of  Pre-K  (Time  2) 


CSR  measure 

Time  1: 

M  ( SD ) 

Time  2: 

M  (SD) 

T1  —  T2 
effect  size 

T1  —  T2 
correlation 

Backwards  Digit  Span 

1.31  (1.20) 

2.05  (2.13) 

.43 

.46 

Copy  Design 

1.40(1.43) 

2.27  (1.70) 

.55 

.59 

DCCS 

1.47  (.57) 

1.75  (.52) 

.51 

.38 

HTKS 

8.91  (11.89) 

15.51  (14.11) 

.51 

.61 

KRISP  Accuracy 

28.94  (4.09) 

31.44(3.13) 

.69 

.56 

KRISP  Reaction  Time 

.15  (.34) 

.30  (.26) 

.50 

.12 

Operation  Span 

8.57  (3.87) 

9.67  (3.18) 

.31 

.38 

Peg  Tapping 

6.99  (6.01) 

10.21  (5.48) 

.56 

.62 

Spatial  Conflict 

20.86  (6.82) 

22.82  (6.06) 

.30 

.31 

Turtle-Rabbit  Accuracy 

54.18  (9.96) 

54.22  (6.62) 

.00 

.20 

Turtle-Rabbit  Reaction  Time 

5.84  (8.33) 

10.69  (15.43) 

.39 

.54 

Whisper  Task 

30.04  (8.13) 

32.82  (6.04) 

.39 

.35 

Note.  N  =  535.  The  pre-post  difference  is  statistically  significant  at  p  <  .001  for  all  measures  except 
Turtle-Rabbit  Accuracy.  Effect  size  is  Cohen’s  d  for  the  difference  between  the  means  at  Time  1  and  Time  2. 
DCCS  =  Dimensional  Change  Card  Sort;  HTKS  =  Head  Toes  Knees  Shoulders;  KRISP  =  Kansas  Reflection- 
Impulsivity  Scale  for  Preschoolers. 
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TABC  Distractibility ,  TABC  Persistence,  CBQ  Attention  Shift¬ 
ing,  and  CBQ  Impulsivity)  and  a  composite  scale  created  by 
summing  z-scores  computed  for  each  teacher  rating  scale. 
These  correlations  were  estimated  as  standardized  regression 
coefficients  in  3-level  regression  models  in  which  the  respec¬ 
tive  CSR  measure  at  either  Time  1  or  Time  2  was  the  sole 
predictor  of  a  teacher  rating  obtained  at  the  corresponding  time. 
The  correlations  of  each  CSR  measure  with  the  composite  scale 
and  with  each  individual  teacher  rating  scale  are  reported  in 
Table  4  for  the  beginning  and  end  of  pre-K. 

As  Table  4  shows,  all  these  correlations  were  statistically  sig¬ 
nificant  except  for  a  few  involving  CBQ  Impulsivity.  The  largest 
correlations  with  the  Teacher  Rating  Composite  appeared  for  Peg 
Tapping,  HTKS,  and  KRISP  Accuracy  (.34  to  .42).  Close  behind 
were  Copy  Design,  DCCS,  and  Turtle-Rabbit  Accuracy  with  cor¬ 
relations  of  at  least  .25.  The  correlations  were  substantially  similar 
for  ratings  at  the  beginning  and  end  of  pre-K.  The  correlations  with 
individual  teacher  rating  scales  showed  similar  patterns,  though 
lower  for  the  CBQ  scales. 

Summary  of  Findings  on  the  Selected  Criteria 

Table  5  summarizes  the  comparative  findings  reported  above 
for  the  performance  of  the  candidate  CSR  measures  by  identi¬ 
fying  the  top  performers  in  each  analysis  based  on  the  magni¬ 
tude  of  the  parameter  estimates  and/or  statistical  significance. 
The  measures  are  listed  with  the  better  performing  ones  first 
rather  than  in  alphabetical  order  as  in  the  previous  tables.  Four 
CSR  measures  were  among  the  top  performers  in  every  analy¬ 
sis:  Copy  Design,  HTKS,  KRISP  Accuracy,  and  Peg  Tapping. 
DCCS  was  very  close  behind,  appearing  in  the  top  performing 
group  in  all  but  one  analysis.  Consideration  must  also  be  given 


to  Backwards  Digit  Span,  which  showed  good  performance  for 
predicting  achievement,  though  it  was  not  among  the  top  per¬ 
formers  in  the  other  analyses.  The  most  notable  feature  of  this 
summary  is  the  consistency  of  the  CSR  measures  that  per¬ 
formed  well — those  that  were  strong  in  one  analysis  were 
strong  in  all  or  nearly  all  of  them,  and  those  weak  in  any  one 
analysis  were  weak  in  all  or  nearly  all. 

Performance  of  the  Top  CSR  Measures 
in  Combination 

As  the  summary  in  Table  5  indicates,  there  were  six  CSR 
measures  that  performed  best  in  the  comparative  analyses.  With 
those  results  in  hand,  we  then  undertook  an  exploration  of  the 
relations  of  those  six  measures  to  achievement  when  taken 
altogether  to  determine  which  showed  the  strongest  indepen¬ 
dent  relations  relative  to  the  others  and  to  assess  the  potential 
value  of  a  composite  of  multiple  measures.  For  that  purpose, 
another  series  of  three-level  regression  analyses  was  conducted 
with  all  six  of  these  measures  used  together  as  predictors.  To 
examine  their  collective  performance,  multiple  correlations 
were  estimated  for  their  relations  to  the  different  dependent 
variables  of  interest.  This  was  done  by  first  fitting  the  models 
with  the  six  CSR  measures  omitted  to  obtain  an  estimate  of  the 
total  unconditional  variance  (residual  variance  when  the 
achievement  pretest  was  a  necessary  covariate)  across  all  levels 
on  the  respective  dependent  variables.  We  then  ran  the  same 
models  with  all  six  measures  included  as  predictors  and  ob¬ 
tained  the  total  conditional  variance  from  those  analyses.  The 
difference  between  the  total  unconditional  variance  without  the 
six  CSR  measures  and  the  total  conditional  variance  with  them 
in  the  model  represents  the  amount  of  the  total  between-student 


Table  4 


Concurrent  Correlations  Between  Child  Cognitive  Self-Regulation  (CSR)  Measures  and  Teacher  Rating  Scales  at  the  Beginning 
(Time  1)  and  End  of  Pre-K  (Time  2)  for  Initial  and  Cross  Validation  Samples 


Initial  sample 
(n  =  535) 

Cross-validation 
sample 
(n  =  356) 

CFBR- 

-work 

TABC- 

TABC- 

CBQ-attention 

CBQ- 

Teacher  rating 

Teacher  ratine 

related  skills 

distractibility 

persistence 

shifting 

impulsivity 

composite 

total 

score 

CSR  measure 

Time  1 

Time  2 

Time  1 

Time  2 

Time  1 

Time  2 

Time  1 

Time  2 

Time  1  Time  2 

Time  1 

Time  2 

Time  1 

Time  2 

Backwards  Digit 

Span 

.24 

.20 

.22 

.15 

.22 

.17 

.14 

.14 

.04 

.05 

.21 

.17 

.30 

.27 

Copy  Design 

.34 

.32 

.33 

.28 

.31 

.32 

.18 

.20 

.11 

.08 

.31 

.29 

.42 

.47 

DCCS 

.27 

.25 

.29 

.24 

.27 

.25 

.21 

.17 

.13 

.12 

.28 

.25 

.36 

.30 

HTKS 

.36 

.39 

.35 

.39 

.28 

.40 

.28 

.29 

.10 

.18 

.34 

.39 

.39 

.41 

KRISP  Accuracy 

.38 

.39 

.35 

.35 

.32 

.38 

.28 

.28 

.12 

.24 

.36 

.40 

.41 

38 

KRISP  RT 

.22 

.20 

.19 

.13 

.21 

.16 

.16 

.13 

.01 

.03 

.19 

.16 

a 

Operation  Span 

.23 

.19 

.23 

.14 

.15 

.16 

.14 

.13 

.06 

.08 

.2(4 

.17 

Peg  Tapping 

.43 

.39 

.42 

.36 

.36 

.36 

.34 

.28 

.16 

.19 

.42 

.38 

.45 

41 

Spatial  Conflict 

.18 

.15 

.24 

.17 

.19 

.20 

.17 

.13 

.16 

.12 

.23 

.19 

Turtle-Rabbit 

Accuracy 

.24 

.23 

.27 

.26 

.20 

.25 

.23 

.23 

.13 

.18 

.26 

.28 

Turtle-Rabbit  RT 

.21 

.17 

.19 

.13 

.16 

.13 

.12 

.08 

.03 

.06 

.17 

14 

Whisper  Task 

.27 

.17 

.27 

.19 

.19 

.20 

.22 

.10 

.03 

.14 

.24 

.20 

— 

— 

Note.  Correlations  greater  than  .09  are  statistically  significant  at  p<  .05  in  multilevel  analysis.  DCCS  =  Dimensional  Chanee  Card  Sort-  HTKS  =  H^iH 

Toes  Knees  Shoulders;  KRISP  =  Kansas  Reflection-Impulsivity  Scale  for  Preschoolers;  RT  =  reaction  time 

a  Measure  not  included  in  cross-validation. 
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Table  5 

Summary  of  the  Performance  of  the  Candidate  Cognitive  Self-Regulation  (CSR)  Measures  on  the  Attributes  Examined 


CSR  measure 

Tl  &  T2  CSR  & 
later 

achievement2 

Predicting  achievement 

Tl  &  T2  CSR  & 
achievement 
gains'3 

PreK  CSR  gains  & 
achievement  gainsc 

Developmental 

changed 

Concurrence  with 
teacher  ratings2 

Copy  Design 

X 

X 

X 

X 

X 

HTKS 

X 

X 

X 

X 

X 

KRISP  Accuracy 

X 

X 

X 

X 

X 

Peg  Tapping 

X 

X 

X 

X 

X 

DCCS 

X 

X 

X 

X 

Backwards  Digit  Span 

X 

X 

X 

Turtle-Rabbit  Accuracy 

X 

KRISP  Reaction  Time 

Operation  Span 

Turtle-Rabbit  Reaction  Time 

Spatial  Conflict 

Whisper  Task 

Note.  The  better  performing  CSR  measures  on  each  criterion  are  indicated  by  X.  DCCS  =  Dimensional  Change  Card  Sort;  HTKS  =  Head  Toes  Knees 
Shoulders;  KRISP  =  Kansas  Reflection-Impulsivity  Scale  for  Preschoolers.  Time  1  (Tl)  =  beginning  of  pre-K;  Time  2  (T2)  =end  of  pre-K;  Time  3 
(T3)  =end  of  kindergarten. 

J  Correlations  for  Tl  predicting  to  T2  and  T3  achievement,  and  T2  predicting  to  T3  achievement  are  a  .35  and  significant  at  p  <  .05.  b  Correlations  for 
Tl  predicting  T1-T2  and  T1-T3  achievement  gain,  and  T2  predicting  T2-T3  achievement  gain  are  significant  at  p  <  .10  or  better.  c  Correlations  for  T1-T2 
gain  predicting  T1-T2  and  T1-T3  achievement  gain  are  significant  at p  £  .05.  d  Effect  size  for  change  from  Tl  to  T2  is  >  .50.  e  Tl  and  T2  correlations 
with  the  Teacher  Rating  Composite  are  >  .25  and  significant  at  p  <  .05. 


variance  accounted  for  by  the  CSR  measures,  essentially  an 
R-squared  value  when  represented  as  a  proportion.  The  square 
root  of  that  estimate  was  taken  as  the  multiple  correlation  of 
interest.  In  addition,  the  standardized  regression  coefficient  for 
each  CSR  measure  indicated  the  independent  contribution  that 
measure  made  to  predicting  the  respective  dependent  vari¬ 
able. 

The  results  of  these  analyses  are  summarized  in  the  upper 
portion  of  Table  6.  The  first  panel  reports  the  collective  relation 
of  the  six  CSR  measures  to  composite  achievement  measured 
later.  The  multiple  correlations,  ranging  from  .68  to  .72,  can  be 
compared  with  the  analogous  correlations  for  the  individual 
measures  shown  in  Table  1,  all  of  which  are  smaller.  The 
standardized  regression  coefficients  indicate  that  the  strongest 
independent  contributions  were  made  by  HTKS,  KRISP  Accu¬ 
racy,  Peg  Tapping,  and  Backwards  Digit  Span. 

The  second  panel  of  Table  6  provides  the  results  for  the  six 
CSR  measures  collectively  predicting  achievement  gain  over 
various  periods.  The  multiple  correlations,  ranging  from  .23  to 
.28,  can  be  compared  with  the  standardized  regression  coeffi¬ 
cients  in  the  first  four  columns  of  Table  2,  all  of  which  are 
notably  smaller.  The  regression  coefficients  in  Table  6  indicate 
that  KRISP  Accuracy  and  HTKS  have  the  strongest  indepen¬ 
dent  relations  to  achievement  gain,  followed  by  Copy  Design 
and  Backwards  Digit  Span. 

The  third  panel  in  Table  6  reports  the  results  for  the  most 
important  predictive  relations  with  achievement — those  be¬ 
tween  residual  gain  on  the  CSR  measures  and  residual  achieve¬ 
ment  gain.  The  multiple  correlations,  which  can  be  compared 
with  the  smaller  standardized  regression  coefficients  for  the 
individual  measures  in  the  last  three  columns  of  Table  2,  ranged 
from  .32  to  .39.  The  individual  measures  making  the  strongest 
independent  contributions  were  Backwards  Digit  Span  and  Peg 
Tapping. 


The  last  panel  of  Table  6  shows  the  multiple  correlations  that 
index  the  concurrent  relations  of  the  set  of  six  CSR  measures  with 
the  composite  teacher  ratings  at  the  beginning  (Tl)  and  end  (T2)  of 
pre-K.  Those  multiple  correlations  (.49  and  .50,  respectively)  can 
be  compared  with  the  analogous  correlations  for  each  individual 
CSR  measure  reported  in  the  first  two  columns  of  Table  4,  all  of 
which  are  smaller.  The  standardized  regression  coefficients  in 
Table  6,  in  turn,  indicate  that  Peg  Tapping,  KRISP  Accuracy,  and 
HTKS  made  the  strongest  independent  contributions  to  those  re¬ 
lations. 

The  results  in  Table  6  show,  unsurprisingly,  that  a  combina¬ 
tion  of  the  six  top  performing  individual  CSR  measures  has 
greater  predictive  relations  with  achievement  and  more  concur¬ 
rence  with  teacher  ratings  than  any  single  measure.  Moreover, 
in  most  instances  the  improvement  in  the  magnitude  of  the 
respective  relations  is  great  enough  to  indicate  that  a  composite 
of  measures  holds  more  promise  as  a  general  measure  of  CSR 
for  pre-K  children  than  any  one  of  them  used  alone.  Among  the 
six  measures,  the  strongest  independent  contributions  were 
made  by  Peg  Tapping,  KRISP  Accuracy,  and  HTKS,  which 
would  thus  be  the  leading  candidates  for  the  most  efficient 
composite  measure.  In  addition.  Backwards  Digit  Span  had  an 
especially  strong  influence  in  the  relation  between  CSR  gain 
and  achievement  gain  and  thus  would  deserve  some  consider¬ 
ation  as  well. 

Cross- V  alidation 

As  described  above,  six  of  the  candidate  CSR  child  assessments 
performed  well  in  our  comparative  analyses.  However,  the  large 
number  of  analyses  conducted  to  identify  those  six  allow  ample 
opportunity  for  chance  factors  in  the  particular  sample  of  children 
and  the  data  they  provided  to  influence  the  results.  In  the  follow-up 
cross-validation  study,  therefore,  we  administered  those  six  mea- 
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Table  6 

Multiple  Correlation  and  Regression  Coefficients  for  the  Top  Six  Cognitive  Self-Regulation  ( CSR)  Measures  Together  Predicting 
Achievement  and  Concurring  With  Teacher  Ratings  in  the  Initial  Sample  (Top  Panel)  and  Cross-Validation  Sample  ( Bottom  Panel) 


Standardized  regression  coefficients  for  CSR  measures 


IVs  and  DV 

Multiple 

correlation 

Copy 

Design 

HTKS 

KRISP  accuracy 

Peg  tapping 

DCCS 

Backwards  Digit 
Span 

Initial  sample  ( N  =  535) 

IVs:  T1  CSR  measures  DV:  T2 

achievement 

IVs:  T1  CSR  measures  DV:  T3 

.72* 

.10* 

.17* 

.19* 

.24* 

.17* 

.19* 

achievement 

IVs:  T2  CSR  measures  DV:  T3 

.68* 

.10* 

.11* 

.25* 

.19* 

.19* 

.17* 

achievement 

IVs:  T1  CSR  measures  DV:  T1-T2 

.71* 

.07* 

.25* 

.17* 

.19* 

.12* 

.23* 

achievement  gain 

IVs:  T1  CSR  measures  DV:  T1-T3 

.29* 

.08* 

.05* 

.05* 

.04 

.041 

.05f 

achievement  gain 

.30* 

.06* 

.02 

.14* 

.03 

.07* 

.04 

IVs:  T2  CSR  measures  DV:  T2-T3 
achievement  gain 

.25* 

.02 

.10* 

.08* 

.01 

.01 

.06* 

IVs:  T1-T2  CSR  gain  DV:  T1-T2 
achievement  gain 

.38* 

.04* 

.06* 

.06* 

.08* 

.07* 

.10* 

IVs:  T1-T2  CSR  gain  DV:  T1-T3 
achievement  gain 

.32* 

.04 

.11* 

.05* 

.04 

.04 

.12* 

IVs:  T1  CSR  measures  DV:  T1 
teacher  ratings 

.56* 

.16* 

.09* 

.15* 

.25* 

.08+ 

.02 

IVs:  T2  CSR  measures  DV:  T2 
Teacher  ratings 

Cross-Validation  Sample  ( N  =  356) 

.57* 

.11* 

.24* 

.22* 

.19* 

.02 

-.04 

IVs:  T1  CSR  measures  DV:  T2 
achievement 

.73* 

.02 

.13* 

.22* 

.25* 

.22* 

.19* 

IVs:  T1  CSR  measures  DV:  T1-T2 
achievement  gain 

.26* 

-.01 

.00 

* 

OO 

O 

.06+ 

.10* 

.04 

IVs:  T1-T2  CSR  Gain  DV:  T1-T2 
achievement  gain 

.37* 

.08* 

,05f 

.06* 

.11* 

.07* 

.04 

IVs:  T1  CSR  measures  DV:  T1 
teacher  ratings 

.60* 

.19* 

.10f 

.19* 

.21* 

.08 

.04 

IVs:  T2  CSR  measures  DV:  T2 
teacher  ratings 

.62* 

.30* 

.16* 

.18* 

,11  + 

.10* 

.03 

Note.  IV  =  independent  variable;  DV 

=  dependant  variable;  DCCS 

=  Dimensional  Change  Card  Sort;  HTKS  =  Head  Toes  Knees  Shoulders;  KRISP  = 

Kansas  Reflection-Impulsivity  Scale  for  Preschoolers.  Academic  achievement  is 
Time  1  (Tl)  =  beginning  of  pre-K;  Time  2  (T2)  =  end  of  pre-K;  Time  3  (T3)  = 
><.10.  ><.05. 

the  composite  measure 
=  end  of  kindergarten. 

combining  five  Woodcock- Johnson  subscales. 

sures1  to  a  new  sample  to  check  the  stability  of  the  key  features 
that  favored  them  in  the  initial  analyses.  We  also  used  this  new 
sample  to  estimate  the  test-retest  reliability  of  the  selected  mea¬ 
sures. 

The  six  selected  CSR  measures  and  the  WJ-III  achievement 
measures  used  in  the  initial  phase  were  administered  in  two  ses¬ 
sions  at  the  beginning  (Time  1)  and  end  (Time  2)  of  the  pre-K  year. 
The  order  of  these  sessions  was  varied,  but  the  measures  were 
administered  in  a  fixed  order  at  each  session.  The  CSR  measures 
were  administered  a  second  time  approximately  two  and  a  half 
weeks  after  the  assessment  sessions  at  the  beginning  of  the  year  to 
allow  estimation  of  test-retest  reliability.  In  addition,  at  Time  1, 
Retest,  and  Time  2,  teachers  completed  ratings  on  a  subset  of  20 
teacher  ratings  items  from  the  initial  phase:  10  items  from  CFBR 
Work-Related  Skills,  3  from  CBQ  (2  Impulsivity,  1  Attention 
Shifting),  and  7  from  TABC  (3  Persistence,  4  Distractibility).  The 
20  selected  items  were  those  that  had  the  largest  loadings  with  the 
common  factor  identified  by  a  principal  components  analysis  of 
the  original  57  items.  Factor  loadings  greater  than  .70  for  data 
collected  at  both  the  beginning  and  end  of  pre-K  indicated  that 


these  20  items  efficiently  represented  the  principal  factor  under¬ 
lying  the  original  57  items  and  they  were  therefore  used  in  the 
cross-validation  to  reduce  the  response  burden  on  the  teachers. 
These  items  were  all  rated  on  7-point  scales  and  showed  a  high 
level  of  internal  consistency  (Cronbach’s  alpha  of  .98  at  both  Time 
1  and  2).  A  total  score  was  computed  as  the  mean  of  the  20  items. 

Test-retest  reliability.  The  mean  interval  between  the  CSR 
assessments  for  the  369  children  in  the  test-retest  sample  was  16.7 
days  ( SD  =  5.0).  Test-retest  reliability  was  estimated  using  mul¬ 
tilevel  regression  to  account  for  the  effect  of  the  nesting  of  children 
within  classrooms  and  classrooms  within  schools.  For  each  mea- 


Scores  for  the  Backward  Digit  Span  measure  in  the  cross-validation 
reflect  the  longest  span  correctly  recalled  (range  =  1-8)  based  on  admin¬ 
istration  procedures  from  the  Wechsler  Intelligence  Score  for  Children  4th 
edition  (Wechsler,  2003).  For  the  KRISP,  we  added  more  advanced  items 
from  version  B  to  provide  a  better  ceiling  (maximum  score  of  48).  Scoring 
was  altered  for  Copy  Design;  every  attempt  was  scored  of  the  two  allowed 
for  each  item,  making  the  scores  range  from  0  to  16.  The  other  cognitive 
self-regulation  measures  were  the  same  as  before. 
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sure,  the  initial  score  was  used  to  predict  the  retest  score  with  the 
standardized  regression  coefficients  then  representing  test-retest 
correlations.  In  descending  order,  those  reliability  coefficients  and 
their  standard  errors  were  Peg  Tapping,  .80  (.03);  HTKS,  .78  (.03); 
Backwards  Digit  Span,  .73  (.04);  Copy  Design,  .72  (.04);  KRISP 
Accuracy,  .64  (.04);  and  DCCS,  .47  (.05).  The  KRISP  reliability 
coefficient  is  modest  and  that  for  DCCS  is  marginal,  but  the  others 
are  in  a  generally  acceptable  range.  Test-retest  reliability  was  also 
estimated  for  a  composite  of  all  six  measures,  yielding  a  reliability 
coefficient  of  .89  (.02). 

Predictive  relations  with  achievement.  To  assess  the  ability 
of  the  CSR  measures  to  predict  academic  achievement  in  the 
cross-validation  sample,  we  first  examined  the  correlations  be¬ 
tween  the  CSR  measures  at  Time  1  and  the  composite  achievement 
score  at  Time  2  (column  4,  Table  1).  As  with  the  initial  sample, 
these  were  estimated  with  standardized  regression  coefficients  in 
multilevel  models.  These  coefficients  were  statistically  significant 
and  very  similar  to  those  found  in  the  initial  sample  (column  1, 
Table  1).  The  correlation  with  achievement  for  a  composite  score 
that  combined  all  six  CSR  measures  is  also  shown  in  Table  1  and 
demonstrates  that  the  combined  set  of  items  performs  notably 
better  than  any  one  item. 

The  ability  of  each  CSR  measure  to  predict  the  gain  children  in 
the  cross-validation  sample  made  in  achievement  over  the  pre-K 
year  was  also  assessed  with  standardized  regression  coefficients 
from  multilevel  models  in  which  Time  1  achievement  was  con¬ 
trolled.  These  coefficients  were  statistically  significant  for  DCCS, 
HTKS,  KRISP  Accuracy,  and  Peg  Tapping  (column  6,  Table  2), 
but  showed  some  modest  inconsistencies  with  the  initial  sample 
results  (column  1,  Table  2)  for  Copy  Design  (.05  vs.  .12),  DCCS 
(.11  vs.  .07),  and  HTKS  (.06  vs.  .11).  The  coefficients  for  the  more 
revealing  relations  between  gains  on  the  CSR  measures  and  gains 
in  achievement  over  the  pre-K  year  (Time  1  to  Time  2)  were 
statistically  significant  for  all  the  measures  (column  7  in  Table  2). 
The  strongest  relations  were  for  Peg  Tapping,  DCCS,  Copy  De¬ 
sign,  and  KRISP  Accuracy  but  here  also  there  were  some  modest 
inconsistencies  with  the  estimates  from  the  original  sample  (col¬ 
umn  5,  Table  2)  for  some  measures,  specifically  Backwards  Digit 
Span  (.06  vs.  .12),  Copy  Design  (.10  vs.  .07),  and  Peg  Tapping  (.15 
vs.  .11).  The  predictive  coefficients  for  the  composite  score  that 
combined  all  six  CSR  measures  are  also  shown  in  Table  2  and  here 
also  the  combined  set  of  items  performs  better  than  any  one  item. 

The  final  series  of  analyses  for  the  predictive  relations  with 
achievement  in  the  cross-validation  sample  investigated  the  inde¬ 
pendent  contribution  of  each  of  the  six  CSR  measures  relative  to 
the  others  when  they  were  used  simultaneously  as  independent 
variables  in  multilevel  regressions.  These  analyses  were  conducted 
using  the  same  models  and  procedures  described  earlier  for  the 
analogous  analyses  with  the  initial  data.  The  results  are  reported  in 
Table  6  (bottom  panel)  along  with  those  for  the  initial  sample  (top 
panel).  Across  all  outcomes,  Peg  Tapping,  DCCS,  and  KRISP 
Accuracy  showed  the  largest  independent  relations  to  achievement 
and  these  were  the  only  three  CSR  measures  for  which  the  coef¬ 
ficients  were  statistically  significant  with  every  outcome.  How¬ 
ever,  Copy  Design  showed  a  significant  independent  gain-with- 
gain  relation  and  HTKS  and  Backwards  Digit  Span  showed 
significant  independent  contributions  to  predicting  Time  2 
Achievement.  Comparing  these  results  with  the  analogous  ones  for 
the  initial  sample,  the  coefficients  most  similar  on  statistical  sig¬ 


nificance  and  magnitude  across  the  achievement  outcomes  were 
for  KRISP  Accuracy,  though  Peg  Tapping,  DCCS,  and  HTKS  also 
showed  relatively  good  replication. 

Concurrent  relations  with  teacher  ratings.  The  total  score 
of  the  20  teacher  rating  items  was  used  as  the  dependent  variable 
in  multilevel  regression  models  with  each  CSR  measure  in  turn  as 
the  sole  independent  variable,  as  in  the  analogous  analysis  with  the 
initial  sample.  The  standardized  regression  coefficients  that  repre¬ 
sent  the  correlations  between  each  CSR  measure  and  the  teacher 
rating  total  score  are  reported  in  the  columns  on  the  far  right  in 
Table  4.  They  ranged  from  .27  to  .47  and  all  were  statistically 
significant.  The  largest  correlations  were  found  for  Copy  Design, 
Peg  Tapping,  HTKS,  and  KRISP  Accuracy.  Compared  with  the 
analogous  values  from  the  initial  sample  (also  shown  in  Table  4), 
all  but  two  of  these  correlations  are  larger  and  those  two  are  close 
to  the  prior  values.  A  broader  view  is  provided  by  the  correlations 
between  the  total  teacher  rating  scores  at  Times  1  and  2  and  the 
composite  score  that  combined  all  six  CSR  measures.  These  were 
.54  and  .60,  respectively  (not  shown  in  Table  4),  again  showing 
stronger  relations  than  any  of  the  individual  measures  in  that 
composite. 

Useful  Information 

As  shown  in  analyses  with  both  the  initial  and  cross-validation 
samples,  none  of  the  individual  top  performing  CSR  measures  did 
nearly  as  well  in  our  tests  as  a  composite  of  all  six  of  them. 
Moreover,  each  of  the  six  measures  made  an  independent  contri¬ 
bution  to  the  predictive  strength  of  the  composite,  so  no  more 
efficient  subset  of  fewer  than  all  six  measures  would  perform  quite 
as  well.  Our  procedure  for  administering  those  measures  with  the 
cross-validation  sample  demonstrated  that  it  was  feasible  to  in¬ 
clude  them  in  a  single  assessment  session  of  35-45  min.  Further 
information  about  these  six  measures  and  how  they  are  adminis¬ 
tered  can  be  obtained  from  the  corresponding  author.  It  might  be 
tempting  to  shorten  the  battery  by  omitting  Copy  Design  and 
Backwards  Digit  Span,  but  analyses  not  reported  here  across  both 
samples  showed  that  this  produced  a  notable  decrement  in  the 
performance  of  the  composite  for  predicting  achievement.  The  six 
measures  are  scored  on  quite  different  scales,  however,  compli¬ 
cating  the  integration  of  them  into  a  single  composite  measure.  For 
research  purposes,  computing  standardized  z-scores  for  each,  then 
summing  them  provides  a  straightforward  way  to  create  such  a 
composite  measure.  Such  standardization,  however,  makes  the 
scoring  dependent  on  the  means  and  standard  deviation  of  the 
particular  sample  on  which  the  data  were  collected.  Those  values 
may  not  be  well  estimated  in  small  samples  and,  in  any  event,  such 
sample  dependence  undermines  comparability  across  samples  and 
studies.  For  more  general  use,  each  measure  can  be  rescaled  into  a 
0-  to  5 -point  format  with  all  six  then  summed  to  create  a  simple 
additive  total  score  that  works  well.  Appendix  B  describes  the 
rationale,  procedure,  and  results  of  this  rescaling. 

Use  of  any  of  the  CSR  measures  identified  in  this  study  as 
outcome  variables  in  research  on  the  effects  of  interventions  with 
pre-K  students  will  likely  involve  cluster-randomized  trials  with 
students  nested  within  classrooms  and  schools.  The  intraclass 
correlations  (ICCs;  also  known  as  intracluster  correlations)  that 
characterize  the  proportions  of  total  variance  that  are  between 
schools  and  between  classrooms  within  schools  are  critical  for 
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estimating  statistical  power  during  the  planning  stage  and  influ¬ 
ence  the  standard  errors  in  multilevel  analysis.  The  multilevel 
structure  of  the  samples  used  in  the  present  study  allows  ICCs  to 
be  estimated  for  classroom  and  school  clusters.  These  estimates  are 
reported  in  Table  7  for  the  initial  sample,  the  larger  of  the  two 
available,  at  the  beginning  of  pre-K.  They  were  estimated  in 
three-level  unconditional  models  with  each  of  the  CSR  measures 
in  turn  as  the  dependent  variable.  The  respective  ICC  values  were 
computed  as  the  proportion  of  the  total  variance  associated  with 
each  of  the  levels  in  the  multilevel  structure.  The  between-school 
and  between-classroom-within-school  ICCs  were  relatively  mod¬ 
est  for  this  pre-K  sample  with  the  between-classroom  value  virtu¬ 
ally  zero  for  several  measures. 

Discussion 

The  objective  of  this  study  was  to  identify  direct  assessment 
measures  of  CSR  for  pre-K  children  that  are  closely  linked  to  their 
academic  achievement,  that  is,  learning-related  cognitive  self¬ 
regulation  (LRCSR),  and  that  perform  well  on  other  criteria  that 
make  them  educationally  relevant  for  research  and  practical  appli¬ 
cations.  In  pursuing  this  objective,  we  evaluated  existing  measures 
in  a  comparative  fashion,  choosing  candidate  measures  with  atten¬ 
tion  to  the  aspect  of  CSR  most  salient  in  the  tasks  the  measure 
presented  to  children,  prior  evidence  about  their  association  with 
academic  achievement,  and  the  ease  with  which  they  could  be 
administered  in  classroom  settings.  The  most  important  consider¬ 
ation  for  identifying  the  best  performing  of  the  selected  measures 
was  their  ability  to  predict  children’s  subsequent  academic 
achievement  and  achievement  gains.  Because  an  important  use  of 
such  measures  is  as  outcomes  for  research  on  interventions  aimed 
at  improving  LRCSR,  we  also  considered  the  extent  to  which  the 
measures  showed  change  over  the  pre-K  year,  thus  demonstrating 
their  ability  to  respond  to  increases  in  children’s  LRCSR  skills. 
Finally,  we  attended  to  the  concurrent  relations  between  the  can¬ 
didate  measures  and  ratings  by  pre-K  teachers  of  children’s 
LRCSR-related  behavior  in  the  classroom  as  a  further  indication  of 
their  educational  relevance. 

Table  7 


Intraclass  Correlation  Coefficients  Associated  With  Students 
Nested  Within  Classrooms  and  Classrooms  Nested  Within 
Schools  in  the  Initial  Sample 


CSR  measure 

Between 

schools 

Between  classrooms 
within  schools 

Between  students 
within 
classrooms 

Backwards  Digit 
Span 

.020 

.000 

.980 

Copy  Design 

.049 

.017 

.934 

DCCS 

.028 

.038 

.934 

HTKS 

.006 

.036 

.958 

KRISP 

Accuracy 

.013 

.000 

.987 

Peg  Tapping 

.035 

.000 

.965 

CSR  Composite 
Score 

.027 

.020 

.953 

Note.  N  =  535.  CSR  =  cognitive  self-regulation;  DCCS  =  Dimensional 
Change  Card  Sort;  HTKS  =  Head  Toes  Knees  Shoulders;  KRISP  = 
Kansas  Reflection-Impulsivity  Scale  for  Preschoolers.  The  CSR  Compos¬ 
ite  Score  combines  the  six  individual  CSR  measures  shown. 


Of  the  12  candidate  measures  evaluated,  analyses  with  the  initial 
pre-K  sample  identified  six  that  performed  especially  well  against 
these  criteria.  Cross-validation  with  a  new  sample  confirmed  the 
stability  of  those  findings.  The  best  performing  measures  overall 
were  Copy  Design,  HTKS,  KRISP  Accuracy,  Peg  Tapping,  DCCS, 
and  Backwards  Digit  Span.  The  single  best  performing  measure 
across  all  our  analyses  was  Peg  Tapping,  with  the  functionally 
similar  HTKS  close  behind  and  KRISP  Accuracy  in  third  place. 

The  findings  reported  here  complement  and  extend  the  body  of 
research  on  the  psychometric  characteristics  of  CSR  measures  for 
pre-K  age  children.  While  same  day  test-retest  reliability  for 
DCCS  has  been  documented  (Beck,  Schaefer,  Pang,  &  Carlson, 
2011)  and  normed  performance  standards  have  been  established 
(Weintraub  et  al.,  2013),  the  present  study  adds  to  the  validation  of 
this  measure  for  use  in  preschool  settings  to  assess  learning-related 
CSR  by  demonstrating  its  relation  with  concurrent  teacher  ratings 
of  CSR  in  situ  and  with  later  academic  achievement.  Also,  with 
regard  to  research  with  the  HTKS  task,  the  present  work  adds  to 
the  somewhat  mixed  results  from  attempts  to  demonstrate  associ¬ 
ations  between  teacher  ratings  of  classroom  behavioral  regulation 
and  academic  achievement  (Graziano  et  ah,  2015;  McClelland  et 
ah,  2007;  Ponitz  et  ah,  2009)  by  demonstrating  test-retest  reliabil¬ 
ity  in  both  CSR  tasks  and  teacher  ratings.  Similarly,  this  study 
contributes  test-retest  and  internal  consistency  reliability  estimates 
and  supportive  validity  data  for  preschool  applications  for  all  six  of 
the  measures  that  performed  best  in  our  comparative  evaluation. 

Although  sound  psychometric  characteristics  are  fundamental 
for  any  CSR  measure  that  will  be  used  for  practical  or  research 
purposes,  the  unique  contribution  of  this  study  is  the  head-to-head 
comparison  of  the  selected  measures  on  a  range  of  probing  per¬ 
formance  indicators  related  to  their  educational  relevance  for 
pre-K  children  in  classroom  settings.  The  results  provide  a  firm 
empirical  basis  for  the  use  of  any  of  the  top  performing  measures 
for  either  of  the  applications  that  motivated  this  study.  The  better 
performing  measures  showed  close  relations  to  achievement  and 
achievement  gains,  sensitivity  to  developmental  change,  and  rea¬ 
sonable  congruence  with  the  CSR-related  behavior  teachers  ob¬ 
served  in  the  classroom.  These  characteristics  make  them  espe¬ 
cially  suitable  as  screening  measures  to  identify  children  whose 
CSR  skills  may  be  low  enough  to  impair  their  learning  in  pre-K 
contexts  and  to  monitor  improvement  in  those  skills  during  the 
pre-K  year.  Further,  those  characteristics  make  the  top  performing 
measures  suitable  choices  as  outcomes  for  intervention  research 
aimed  at  improving  those  CSR  skills  that  have  sufficiently  close 
relations  to  learning  that  such  improvement  may,  in  turn,  boost 
academic  achievement.  It  is  especially  fortuitous  in  this  regard  that 
the  best  performing  individual  CSR  measures  are  among  the 
easiest  to  administer  and  score.  Most  notably,  peg  tapping  and 
HTKS  performed  very  well  by  the  criteria  we  applied  and  both  can 
be  administered  quickly  and  easily  without  special  equipment  or 
materials  and  without  extensive  trainiVig.  Each  could  thus  be  used 
on  a  stand-alone  basis  with  the  results  reported  here  providing 
assurance  of  their  educational  relevance  for  pre-K  students. 

Similar  to  Willoughby  et  al.  (2016),  however,  we  found  that  the 
best  performing  CSR  measures  work  better  as  a  composite.  It  is 
hardly  surprising  that  a  combination  of  related  measures  performs 
better  than  any  individual  measure  by  itself.  What  was  somewhat 
unexpected  was  that  each  of  the  six  made  significant  independent 
contributions  to  at  least  some  of  the  relations  examined  with  them 
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m  combination.  The  best  composite  measure  based  on  these  re¬ 
sults,  therefore,  would  include  all  six  individual  measures.  How¬ 
ever,  there  were  differences  in  the  independent  contribution  each 
measure  made  to  the  performance  of  that  composite,  with  Peg 
Tapping,  KRISP  Accuracy,  and  HTKS  demonstrating  the  strongest 
independent  relations,  and  gain  on  Backwards  Digit  Span  showing 
an  especially  strong  independent  relation  to  achievement  gain.  A 
more  efficient  composite  measure  incoiporating  the  three  top  per¬ 
formers  in  this  analysis,  possibly  with  Backward  Digit  Span  in¬ 
cluded  as  a  fourth,  therefore,  would  also  perform  better  than  any 
single  measure  while  not  requiring  data  collection  on  all  six. 

It  is  notable  that  the  six  measures  contributing  to  the  full  composite 
measure  represented  a  mix  of  CSR  skills — two  primarily  emphasizing 
sustained  attention  (Copy  Design,  KRISP),  two  emphasizing  inhibi¬ 
tory  control  (Peg  Tapping,  HTKS),  one  emphasizing  attention  shifting 
(DCCS),  and  one  emphasizing  working  memory  (backward  digit 
span).  Although  the  skills  in  this  mix  were  interrelated  and  all  were 
found  to  be  relevant  to  learning,  no  one  skill  dominated  so  strongly 
that  the  others  were  irrelevant.  This  grouping  of  tasks  as  indicators  of 
CSR  aligns  with  prior  work  supporting  a  multidimensional  view  of 
executive  function  (Miyake  et  al.,  2000). 

Recognition  of  the  advantages  of  an  ensemble  of  measures  to 
fully  represent  various  forms  of  CSR  is  not  unique  to  the  present 
study.  The  work  on  the  NIH  Toolbox  of  brief  measures  for 
executive  function  (Weintraub  et  al.,  2013;  Zelazo  &  Bauer,  2013) 
and  the  program  of  research  by  Willoughby  and  colleagues  (Wil¬ 
loughby,  Blair,  Wirth,  &  Greenberg,  2012;  Willoughby,  Wirth,  & 
Blair,  201 1)  also  takes  this  approach.  Moreover,  Willoughby  et  al. 
(2016)  identified  a  cluster  of  measures  of  executive  function  that 
are  associated  with  academic  achievement.  Similarly,  the  Chicago 
School  Readiness  Project  has  developed  a  brief  direct  assessment 
battery  of  CSR  measures  appropriate  for  field-based  settings 
(Smith-Donald,  Raver,  Hayes,  &  Richardson,  2007).  The  measures 
in  their  battery  have  shown  relations  to  classroom  learning  behav¬ 
iors  (Denham,  Warren-Khot,  Bassett,  Wyatt,  &  Perna,  2012)  and 
concurrent  and  future  academic  achievement  (Brock,  Rimm- 
Kaufman,  Nathanson,  &  Grimm,  2009)  in  young  children. 

The  present  work  extends  these  efforts  in  several  ways.  First,  it 
introduces  a  particular  mix  of  easy  to  administer  LRCSR  measures 
especially  suitable  for  use  in  early  childhood  education  settings. 
Additionally,  it  broadens  the  scope  of  the  data  supporting  the 
relations  of  these  LRCSR  measures  to  learning  in  those  settings. 
As  with  the  measures  in  the  Chicago  School  Readiness  Project 
battery,  the  measures  we  have  identified  are  related  to  academic 
achievement  and  teacher  reported  LRCSR,  relations  that  have  not 
been  demonstrated  for  many  CSR  measures  appropriate  for  use 
with  pre-K  age  children.  In  addition,  however,  this  study  has 
further  explored  relations  with  achievement  gains,  assessed  devel¬ 
opmental  change  over  time,  and  demonstrated  both  test-retest  and 
internal  consistency  reliability  for  the  better  performing  measures 
that  emerged  from  our  comparative  evaluation. 

There  are,  of  course,  limitations  to  the  research  presented  here. 
For  practical  reasons,  it  was  not  possible  to  collect  data  and 
conduct  comparative  analyses  for  all  the  CSR  measures  that  have 
been  used  with  pre-K  age  children.  The  selections  we  made  may 
have  omitted  some  measures  that  would  have  performed  as  well  or 
better  than  those  chosen.  In  particular,  we  would  expect  some  of 
the  computer-based  measures  to  perform  well  on  our  criteria. 
Nonetheless,  we  excluded  them  to  focus  on  measures  that  did  not 


require  computer  support  or  Internet  connections,  which  we  be¬ 
lieve  makes  them  more  accessible  and  easily  used  in  pre-K  class¬ 
room  settings,  especially  for  potential  use  by  teachers. 

We  also  acknowledge  the  uncertain  generalizability  of  the  findings 
based  on  the  sample  of  pre-K  children  that  provided  the  data  for  this 
study.  Though  the  initial  sample  included  more  than  500  children 
drawn  from  a  relatively  large  number  of  schools  and  community 
childcare  centers,  it  was  of  necessity  limited  to  children  whose  parents 
consented  to  their  participation.  We  have  no  data  for  the  40%  of  the 
children  in  those  classrooms  whose  parents  did  not  return  consent 
forms  (only  a  very  few  actively  declined  to  consent)  and  thus  have  no 
basis  for  determining  if  they  were  systematically  different  from  those 
consented  in  ways  that  might  have  affected  our  findings.  And,  though 
the  sample  was  diverse  with  regard  to  gender,  race,  and  economic 
status,  it  was  fundamentally  a  convenience  sample,  not  a  probability 
sample  of  a  defined  population  of  pre-K  children.  Because  of  the  span 
of  schools,  childcare  centers,  and  classrooms,  we  have  some  confi¬ 
dence  that  this  sample  represented  fairly  typical  pre-K  age  children  in 
the  middle  Tennessee  region,  but  no  assurance  that  similarly  con¬ 
structed  samples  in  other  parts  of  the  country  would  have  produced 
comparable  findings. 

We  also  must  emphasize  that  the  fact  that  the  LRCSR  measures 
identified  in  this  study  are  predictive  of  later  achievement  and 
achievement  gains  does  not  mean  that  they  represent  causal  factors  for 
those  outcomes.  Our  purpose  in  this  study  was  not  to  attempt  to 
establish  causal  relations  but,  among  other  objectives,  to  identify 
measures  that  might  be  especially  appropriate  as  outcome  variables  in 
research  that  does  investigate  causal  influences.  With  conceptually 
relevant  and  responsive  measures  in  hand,  a  key  question  for  future 
research  is  what  practical  interventions  or  teacher  practices  are  capa¬ 
ble  of  increasing  pre-K  children’s  LRCSR  skills.  There  is  some 
evidence  using  one  or  another  of  the  measures  identified  here  that 
such  effects  are  possible  (e.g.,  Bierman  et  al.,  2008;  Fuhs,  Farran,  & 
Nesbitt,  2013;  Raver  et  al.,  2011),  but  also  some  less  encouraging 
findings  (e.g.,  Barnett  et  al.,  2008). 

Assuming  that  LRCSR  can  be  boosted,  an  even  more  important 
question  is  whether  doing  so  for  pre-K  children  will,  in  turn,  lead 
to  greater  learning  and  increased  academic  achievement.  With 
regard  to  that  question,  we  believe  the  concurrence  between  the 
LRCSR  measures  identified  in  this  study  and  teacher  ratings  of 
LRCSR-related  behaviors  in  the  classroom  is  especially  important. 
A  very  plausible  theory  of  action  for  the  potential  effects  on 
academic  achievement  of  interventions  that  increase  LRCSR  is 
that  they  are  mediated  by  the  kinds  of  LRCSR-related  behaviors 
teachers  observe  in  the  classrooms,  for  example,  engagement  in 
learning  activities,  persistence  in  completing  tasks,  attentiveness  to 
teachers’  instructions,  and  the  like.  Testing  such  causal  and  medi- 
ational  relations  is  best  done  via  randomized  experiments  and  goes 
beyond  the  scope  of  the  present  study  but  is  a  promising  area  for 
further  research  aimed  at  improving  the  effectiveness  of  pre-K 
instruction.  Based  on  the  evidence  developed  in  the  present  study, 
the  well-performing  LRCSR  measures  we  have  identified  should 
be  quite  appropriate  for  supporting  such  research. 
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Appendix  A 

Candidate  Direct  Child  Assessment  Tasks  Categorized  by  the  Most  Salient  CSR  Component  Skill 


Instrument 


Task 


1.  Copying  tasks 

Bender  Gestalt  Test  (Bender,  1938; 
Dibner  &  Korn,  1969;  Koppitz, 
1973) 

Copy  Design  (Davie  et  ah,  1972; 
Osborn  et  al.,  1984) 

2.  Matching  tasks 

Matching  Familiar  Figures  Test 
(Kagan  et  ah,  1964) 


Sustained  attention — attending  to  and  sustaining  focus  on  a  task 


Children  copy  nine  simple  geometric  designs  exactly. 

Children  copy  eight  simple  geometric  designs  exactly.  t 

Children  select  a  picture  that  matches  a  target  picture;  accuracy  is  a  measure  of  attention, 
over-fast  reaction  times  assess  impulsivity. 
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Appendix  A  ( continued ) 


Instrument 


Kansas  Reflection-Impulsivity  Scale 
for  Preschoolers  (Wright,  1971) 
Tower  of  London  Task  (Kirkorian  et 
ah,  1994;  Ward  et  al.,  2005) 

3.  Stimulus-response  tasks 
Continuous  Performance  Test  (Beck 
et  ah,  1956) 

Self-Regulation  Test  for  Children 
(Howse  et  ah,  2003;  Kuhl  & 
Kraska,  1993) 


Task 


Children  select  a  picture  that  matches  a  target  picture;  accuracy  is  a  measure  of  attention, 
over-fast  reaction  times  assess  impulsivity. 

Children  build  a  tower  to  match  a  picture  using  blocks  and  a  peg  for  stacking. 

Children  press  a  button  when  a  computer-generated  target  stimulus  appears  and  inhibit 
responses  to  non-target  stimuli.  Correct  responses  measure  of  sustained  attention; 
incorrect  responses  measure  impulsivity. 

Children  press  the  button  that  matches  the  target  on  a  screen;  can  involve  distractors  to 
make  the  task  more  difficult. 


Attention  shifting — shifting  focus  within  or  between  tasks  as  situations  demand 


Something’s  the  Same/Item  Selection 
(Blair  &  Willoughby,  2006a) 
Dimensional  Change  Card  Sort 
(Diamond  et  al.,  2005;  Zelazo, 
2006)  &  variants  (Wisconsin  Card 
Sorting  Task) 

Flexible  Item  Selection  Task  (Jacques 
&  Zelazo,  2001) 


Children  categorize  colored  pictures  first  by  the  object,  then  switch  to  sort  by  color. 

Children  sort  a  set  of  cards  by  shape,  then  switch  to  sort  by  color.  A  more  difficult 
version  adds  a  variable  cue  that  indicates  the  sorting  rule. 


Children  select  a  pair  of  cards  that  match  on  one  dimension  (e.g.,  shape,  color),  then 
must  select  a  different  pair  that  matches  on  another  dimension. 


Working  memory — active  maintenance  and  manipulation  of  information  in  memory 


Operation  Span  (Blair  &  Willoughby, 
2006b) 

Self-Ordered  Pointing  (Petrides  & 
Milner,  1982). 

Backward  Digit  Span  (Davis  &  Pratt, 
1996;  Pickering  &  Gathercole, 
2001). 

Corsi  Block  Tapping  (Berch  et  al., 
1998) 


Children  must  recall  a  series  of  objects  shown  inside  a  picture  of  a  house.  A  color 
distractor  adds  difficulty. 

Children  are  presented  a  set  of  pages  divided  into  four  sections;  they  must  go  through 
and  point  to  a  different  remembered  picture  on  each  page. 


Children  recall  a  series  of  orally  presented  digits  backwards. 

Children  reproduce  the  sequence  in  which  the  assessor  taps  a  series  of  blocks  with 
sequences  of  increasing  length. 


Inhibitory  control — volitional  inhibition  of  a  prepotent  response  to  complete  a  task 


1.  Stroop-like  tasks 

Silly  Sounds  Game  (Blair  & 
Willoughby,  2006d) 

Day /Night  (Carlson  &  Moses,  2001; 

Gerstadt  et  al.,  1994) 

Grass/Snow  (Carlson  &  Moses, 

2001) 

2.  Stroop-like  tasks  with  motor  response 
Bear  &  Dragon  and  variants  (Jones 

et  al.,  2003;  Reed  et  al.,  1984) 

Simon  Says  (Carlson,  2005; 

Strommen,  1973) 

Head-to-Toes;  Head-Toes-Knees- 
Shoulders  (Ponitz  et  al.,  2009; 
McClelland  et  al.,  2007) 

Luria  Hand  Game  (Hughes,  1998; 
Luria  et  al.,  1964) 

Peg-  or  Finger-Tapping  (Diamond  & 
Taylor,  1996;  Diamond  et  al., 

1997;  Smith-Donald  et  al.,  2007) 
Pig  Game  (Blair  &  Willoughby, 
2006c) 


Children  meow  to  pictures  of  dogs  and  bark  to  pictures  of  cats. 

Children  say  “night”  to  sun  pictures  and  “day”  to  moon  pictures. 

Children  say  “green”  to  snow  pictures,  and  “white”  to  grass  pictures. 

A  bear  puppet  and  dragon  puppet  give  children  tasks  (touching  feet,  hopping,  etc.);  then, 
they  are  asked  to  only  perform  tasks  given  by  the  bear  puppet,  not  those  given  by  the 
dragon  puppet. 

Children  perform  certain  tasks  (touching  their  feet,  hopping,  etc.)  only  when  the  assessor 
precedes  the  command  with  “Simon  Says.” 

Children  do  the  opposite  of  what  the  assessor  requests;  e.g.,  if  asked  to  touch  their  head, 
they  touch  their  toes. 

Similar  to  the  Head-to-Toes  task,  children  do  the  opposite  of  what  the  assessor  indicates 
using  hand  signals  (e.g.,  holding  up  a  fist  vs.  one  finger). 

Children  tap  a  peg,  pencil,  or  finger  twice  when  the  experimenter  taps  once,  and  vice 
versa. 

In  a  series  of  animal  pictures,  children  press  a  button  when  they  see  animals  that  aren’t 
pigs,  and  don’t  press  the  button  when  they  do  see  pigs. 


3.  Spatial  Conflict/Simon  Tasks 

Spatial  conflict  (Blair  &  Willoughby, 
2006e) 


Target  pictures  are  presented  on  the  left  side  of  a  paper  and  children  point  to  the  target 
pictures  with  their  right  hand,  and  vice  versa. 
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Appendix  A  ( continued ) 


Instrument 


Spatial  conflict  (Gerardi-Caulton, 
2000) 

4.  Flanker  Tasks 

Attention  Network  Task/Flanker 
Task  (Ponitz  et  al..  2009;  Rueda  et 
al.,  2004) 


Task 

Computerized  version  of  the  task  above;  children  push  a  key  on  one  side  of  the  keyboard 
for  target  on  the  opposite  side  of  the  computer  screen. 

Children  indicate  the  direction  of  a  target  flanked  by  same/opposite  direction  distractors;  e.g.,  feed 
the  central  fish  by  pressing  a  button  corresponding  to  the  direction  which  the  middle  fish  is 
swimming  when  flanked  by  fish  swimming  the  same  or  opposite  direction. 


Snack  delay;  gift  delay  (Kochanska  et 
al.,  2000) 

Tower;  Turn-taking  (Kochanska  et  al., 
1996) 

Whisper  (Kochanska,  et  al.,  1996) 

Walk-a-Line  Slowly/Draw-a-Line 
Slowly  (Maccoby  et  al.,  1965) 
Turtle  and  Rabbit  (Kochanska  et  al., 
1996) 


Effortful  control — suppression  of  impulsive  or  premature  responses  when  required  by  a  task 

Variety  of  delay  tasks  in  which  children  must  wait  before  eating  a  cookie,  open  a  gift, 


etc. 

Assessor  and  child  take  turns  placing  blocks  on  a  tower;  children  must  wait  their  turn 
without  reminders. 

Children  see  pictures  of  familiar  cartoon  characters  and  whisper  their  names;  number 
whispered  vs.  shouted  or  said  in  normal  voice  is  scored. 

Children  walk  or  draw  a  line  at  normal  speed,  then  do  the  same  thing  slowly;  time 
difference  between  regular  and  slow  trials  is  scored. 

Children  are  given  “fast”  rabbit  and  a  “slow”  turtle  toys  and  move  them  along  a  path;  scored  for 
accuracy  in  negotiating  the  path  and  the  time  difference  between  fast  and  slow  trials. 


Appendix  B 

Scoring  Scheme  for  the  Child  Measures  of  Learning-Related  Cognitive  Self-Regulation 


Scores  in  the  original  metric  for  each  measure 


Rescaled 

score 

Peg  Tapping 

HTKS 

KRISP 

DCCS 

Copy 

Design 

Backwards 
Digit  Span 

0 

<5 

<7 

<25 

0 

0 

0 

1 

6-7 

8-15 

26-29 

1 

1 

1 

2 

8-9 

16-23 

30-32 

1 

2 

2 

3 

10-12 

24-31 

33-36 

2 

3 

3 

4 

13-14 

32-38 

37-39 

2 

4-5 

4 

5 

>14 

>38 

>39 

3 

>5 

>5 

Note.  HTKS  =  Head-Toes-Knees-Shoulders;  KRISP  =  Kansas  Reflection-Impulsivity  Scale  for  Preschoolers;  DCCS  = 
Dimensional  Change  Card  Sort.  The  six  learning-related  cognitive  selfregulation  (LRCSR)  measures  identified  in  this  paper 
were  scored  on  different  scales  (e.g.,  0-3  for  DCCS,  0-52  for  HTKS),  complicating  the  construction  of  a  total  score  for  all 
six  measures  together.  One  solution  is  to  rescale  the  scores  on  each  measure  to  a  common  scale,  then  sum  them  for  a  total 
score.  We  found  that  a  0-5  point  scale  format  worked  well  for  this  purpose.  To  determine  which  original  scores  should  be 
rescored  into  each  value  on  this  common  scale,  we  took  advantage  of  the  linear  relation  between  children’s  age  and  their 
scores  on  each  measure.  Using  data  from  the  initial  sample,  we  regressed  the  scores  for  each  measure  on  age  and  used  the 
results  to  estimate  the  scores  in  the  original  metric  expected  at  ages  4.0, 4.5, 5.0, 5.5,  and  6.0,  spanning  the  pre-K  age  range. 
These  estimates  were  then  used  as  break  points  for  rescaling  each  original  score  into  the  0-5  format.  The  resulting  procedure 
is  shown  above.  In  the  initial  sample  with  which  this  scheme  was  constructed,  correlations  between  rescaled  scores  and  those 
in  the  original  metric  ranged  from  .92  to  1.00  across  measures  and  the  Time  1  (beginning  of  pre-K)  and  Time  2  (end  of 
pre-K)  measurement  waves.  They  also  performed  well  for  the  Time  3  end  of  kindergarten  measures  with  correlations  from 
.82  to  .98.  When  applied  to  the  Time  1  and  2  data  from  the  cross-validation  sample,  the  correlations  ranged  from  .91  to  1 .00. 
The  total  scores  produced  by  summing  the  rescaled  scores  across  all  six  items  showed  correlations  from  .94  to  .99  with  the 
factor  scores  for  Time  1,  2,  and  3  in  the  initial  sample,  and  correlations  from  .97  to  .99  with  the  Time  l  and  2  factor  scores 
in  the  cross-validation  sample. 
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Research  examining  effective  reading  interventions  for  students  with  reading  difficulties  in  the  upper 

elementary  grades  is  limited  relative  to  the  information  available  for  the  early  elementary  grades.  In 
the  current  study,  we  examined  the  effects  of  a  multicomponent  reading  intervention  for  students 
with  reading  comprehension  difficulties.  We  used  a  partially  nested  analysis  with  latent  variables  to 
adequately  match  the  design  of  the  study  and  provide  the  necessary  precision  of  intervention  effects. 
We  examined  the  effects  of  the  intervention  on  students’  latent  word  reading,  latent  vocabulary,  and 
latent  reading  comprehension.  In  addition,  we  examined  whether  these  effects  differed  for  students 
of  varying  levels  of  reading  or  English  language  proficiency.  Findings  indicated  the  treatment 
significantly  outperformed  the  comparison  on  reading  comprehension  (Effect  Size  =  0.38),  but  no 
overall  group  differences  were  noted  on  word  reading  or  vocabulary.  Students’  initial  word  reading 
scores  moderated  this  effect.  Reading  comprehension  effects  were  similar  for  English  learner  and 
non-English  learner  students. 


Educational  Impact  and  Implications  Statement 

This  study  examined  the  effects  of  a  multi-component  reading  intervention  for  students  with  reading 
difficulties  in  fourth  grade.  Findings  indicated  students  receiving  the  intervention  made  greater  gains 
in  reading  comprehension  than  students  who  did  not  receive  the  intervention.  This  finding  was 
similar  for  students  who  were  English  learners  or  non-English  learners.  However,  students  with 
higher  initial  word  reading  scores  benefited  more  from  the  intervention.  These  findings  suggest 
students  receiving  the  intervention  made  progress  in  closing  the  gap  between  their  current  level  of 
performance  and  expected  levels  of  performance  in  reading  comprehension. 


Keywords:  reading  intervention,  reading  difficulties,  elementary 


Students  with  reading  difficulties  can  benefit  from  supplemental 
reading  instruction  provided  in  small  groups;  reading  interventions 
at  the  elementary  level  have  demonstrated  power  for  preventing 
and  remediating  many  reading  difficulties  (Blachman  et  al.,  2004; 


Mathes  et  al.,  2005;  O’Connor,  Fulmer,  Harty,  &  Bell,  2005; 
Torgesen  et  al.,  1999;  Vellutino  et  al.,  1996).  Flowever,  research 
examining  effective  reading  interventions  for  students  with  read¬ 
ing  difficulties  in  the  upper  elementary  grades  is  limited  relative  to 
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the  information  available  for  the  early  elementary  grades  (Wanzek, 
Wexler,  Vaughn,  &  Ciullo,  2010).  The  need  for  effective  reading 
interventions  for  students  with  reading  difficulties  in  the  upper 
elementary  grades  is  essential  given  the  large  numbers  of  students 
who  continue  to  struggle  with  reading  at  these  grade  levels  (Na¬ 
tional  Center  for  Educational  Statistics,  2016). 

Reading  Interventions  for  Upper  Elementary  Students 

The  research  available  on  reading  interventions  related  to  upper 
elementary  students  with  reading  difficulties  demonstrates  positive 
effects  for  interventions  providing  instruction  in  comprehension  or 
word  recognition  (Wanzek  et  al.,  2010).  Higher  effects  were  noted 
for  interventions  related  specifically  to  comprehension  instruction. 
For  example,  large  mean  effects  across  comprehension  measures 
were  noted  in  two  experimental  studies  of  comprehension  strategy 
instruction  for  students  with  reading  difficulties  (Mason,  2004; 
Miranda  et  al.,  1997).  However,  the  upper  elementary  research, 
including  these  comprehension  interventions,  has  also  largely  ex¬ 
amined  intervention  effects  on  proximal,  researcher-developed 
measures.  In  fact,  15  of  the  24  studies  synthesized  by  Wanzek  et 
al.  (2010)  used  only  researcher-developed  measures.  Researcher- 
developed  measures  often  result  in  higher  effects  than  standardized 
measures  of  the  same  constructs  (Scammacca  et  al.,  2007;  Swan¬ 
son,  Hoskyn,  &  Lee,  1999).  Thus,  the  lack  of  information  on  the 
effects  of  providing  comprehension  interventions  on  standardized 
measures  represents  a  gap  in  the  knowledge  base  on  upper  ele¬ 
mentary  reading  interventions. 

Additionally,  Wanzek  et  al.  (2010)  reported  that  most  of  re¬ 
search  thus  far  on  upper  elementary  reading  interventions  for 
students  with  reading  difficulties  has  been  conducted  with  rela¬ 
tively  brief  interventions  (e.g.,  15-min  sessions;  less  than  6  weeks) 
that  examined  single  instructional  strategies  (e.g.,  main  idea  strategy). 
These  studies  provide  important  information  regarding  effective  prac¬ 
tices  that  could  be  incorporated  in  reading  interventions  to  accelerate 
student  learning.  Knowledge  of  student  outcomes  when  effective 
practices  for  various  reading  components  are  put  together  to  form 
more  comprehensive  interventions  for  struggling  readers  is  also 
needed. 

In  fact,  some  of  the  highest  effects  in  the  upper  elementary 
reading  intervention  literature  have  come  from  multicomponent 
interventions  (Wanzek  et  al.,  2010).  Though  there  are  only  a  few 
of  these  studies  in  the  literature  (e.g.,  O’Connor  et  al.,  2002; 
Ritchey,  Silverman,  Montanaro,  Speece,  &  Schatschneider,  2012; 
Therrien,  Wickstrom,  &  Jones,  2006;  Vadasy  &  Sanders,  2008; 
Wanzek  &  Roberts,  2012),  the  findings  suggest  the  possible  im¬ 
portance  of  addressing  multiple  reading  components  in  reading 
intervention  for  these  older  students.  Three  of  these  studies  dem¬ 
onstrated  moderate  to  large,  significant  effects  on  norm-referenced 
measures  of  comprehension  or  broad  reading  achievement 
(O’Connor  et  al.,  2002;  Therrien  et  al.,  2006;  Vadasy  &  Sanders, 
2008).  The  effect  sizes  ranged  from  0.37  to  1.87.  The  interventions 
in  these  studies  included  instruction  in  reading  comprehension 
along  with  additional  instruction  in  word  reading  (O’Connor  et  al., 
2002),  fluency  (O’Connor  et  al.,  2002;  Therrien  et  al.,  2006; 
Vadasy  &  Sanders,  2008),  and/or  vocabulary  (Vadasy  &  Sanders, 
2008).  The  findings  suggest  students  with  reading  difficulties  at 
the  upper  elementary  level  may  benefit  most  when  interventions 
focus  on  multiple  elements  of  reading,  providing  opportunities  for 


students  to  integrate  reading  practices  to  read  and  understand  text. 
In  an  earlier  synthesis  of  interventions  for  students  with  learning 
disabilities,  Swanson  et  al.  (1999)  reported  the  highest  effects  for 
interventions  that  combine  direct  instruction  of  content  with  strat¬ 
egy  instruction.  Most  of  the  multiple  component  reading  interven¬ 
tions  conducted  at  the  upper  elementary  level  have  incorporated 
both  types  of  instruction.  Several  other  syntheses  for  older  students 
confirm  the  value  of  multicomponent  interventions  (Kamil  et  al., 
2008;  Scammacca  et  al.,  2007;  Torgesen  et  al.,  2007). 

The  previous  research  also  suggests  some  differential  effects  for 
English  learners  (ELs)  with  reading  difficulties  relative  to  their 
non-EL  peers  (Kieffer,  2008).  In  particular,  ELs  are  at  a  markedly 
greater  risk  of  late-emerging  (after  Grade  3)  reading  difficulties 
(Kieffer,  2010,  2014),  suggesting  reading  foundation  skills  such  as 
word  reading  may  be  mastered  more  easily.  But,  many  ELs  may 
struggle  later  with  understanding  texts  that  have  more  complex 
syntax,  vocabulary,  or  background  knowledge  needs.  Previous 
fourth  grade  interventions  have  noted  higher  effects  for  ELs  in 
reading  intervention  on  word  reading  measures  but  not  on  com¬ 
prehension  or  vocabulary  measures  (Wanzek  &  Roberts,  2012). 
Thus,  examining  the  differential  effects  of  ELs  with  a  multicom¬ 
ponent,  comprehension  focused  reading  intervention  program 
could  provide  additional  evidence  regarding  for  whom  a  reading 
intervention  is  most  valuable. 

Passport  to  Literacy 

One  multicomponent  reading  intervention  that  is  widely  used  in 
schools  across  the  United  States  is  Passport  to  Literacy.  Passport  to 
Literacy  is  a  packaged  program  that  applies  principles  of  behav¬ 
ioral  learning  theory  and  cognitive  psychology  (Flavell,  1992; 
Palincsar  &  Brown,  1984),  providing  explicit  instruction  and  strat¬ 
egies  for  reasoning  in  the  foundational  skills  of  reading  (e.g., 
decoding,  word  reading)  as  well  as  reading  comprehension  and 
vocabulary.  Semiscripted  lessons  are  built  sequentially  to  help 
students  acquire  missing  foundational  reading  skills,  increase 
background  knowledge,  and  build  strategies  for  comprehending 
text. 

Although  Passport  to  Literacy  is  widely  used,  there  is  a  lack  of 
independent  research  on  the  program’s  effectiveness.  We  con¬ 
ducted  one  initial  study  of  the  Passport  to  Literacy  intervention 
with  fourth  grade  students.  This  study  was  the  first  causal  study 
conducted  on  Passport  to  Literacy  and  also  the  first  to  examine 
outcomes  on  standardized  measures  of  reading  achievement. 
Fourth  grade  students  scoring  below  the  30th  percentile  in  reading 
comprehension  (n  =  221)  were  randomly  assigned  to  receive  the 
standard  implementation  of  the  Passport  to  Literacy  intervention  or 
typical  school  services.  The  intervention  was  provided  in  small 
groups  of  four  to  seven  students  for  30  min,  4  days  a  week 
throughout  the  school  year  (M  =  90.45  lessons).  There  were  no 
effects  for  Passport  to  Literacy  on  standardized  measures  of  word 
reading  or  fluency,  but  small  effects  were  noted  on  standardized 
measures  of  reading  comprehension  (Effect  Size  (ES)  =  0.14  to 
0.28).  Exploratory  analyses  indicated  the  intervention  effects  dif¬ 
fered  by  students’  comprehension  abilities.  Students’  exhibiting 
low  levels  of  comprehension  demonstrated  no  increased  benefit  of 
the  Passport  to  Literacy  standard  intervention.  In  other  words,  the 
multicomponent  Passport  to  Literacy  intervention  demonstrated 
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average  increased  outcomes  on  reading  comprehension,  but  was 
least  effective  for  students  with  the  lowest  comprehension  levels. 

In  the  current  study,  we  build  upon  this  previous  study  to 
examine  the  effects  of  Passport  to  Literacy  with  a  larger  sample. 
This  larger  sample  allows  for  a  more  sophisticated  analysis  that 
matches  the  design  of  the  study  taking  into  account  the  differing 
clustering  structures  of  the  treatment  and  comparison  groups.  In 
addition,  the  larger  sample  allows  us  to  be  more  precise  in  mea¬ 
suring  student  reading  achievement  through  the  use  latent  vari¬ 
ables.  By  using  latent  variables,  the  impact  and  exploratory  anal¬ 
yses  reflect  a  stronger  test  of  theory  as  effects  are  less  due  to 
assessment-specific  outcomes  and  more  to  the  theoretical  overlap 
among  them.  Finally,  the  larger  sample  included  a  large  enough 
sample  of  ELs  to  examine  other  possible  associations  that  may 
explain  the  differential  effects  noted  in  the  first  study. 

Study  Purpose 

The  purpose  of  this  study  was  to  examine  the  effects  of  the 
standard  implementation  of  the  Passport  to  Literacy  intervention 
for  students  with  reading  comprehension  difficulties.  We  sought  to 
examine  the  effects  of  this  multicomponent  intervention  on  stu¬ 
dents’  word  reading,  vocabulary,  and  reading  comprehension.  In 
addition,  we  examined  whether  these  effects  differed  for  students 
with  varying  levels  of  reading  or  English  language  proficiency. 
Specifically,  we  examined  the  following: 

1.  What  are  the  effects  of  Passport  to  Literacy  on  students’ 
word  reading,  vocabulary,  and  reading  comprehension? 

2.  Do  these  effects  differ  by  initial  reading  achievement  or 
English  language  level? 

On  the  basis  of  the  previous  study  of  the  intervention,  we 
hypothesized  that  students  with  reading  difficulties  receiving 
the  Passport  to  Literacy  intervention  would  outperform  students 
receiving  typical  school  services  in  reading  comprehension  and 
not  in  word  reading  or  vocabulary.  We  also  hypothesized  that 
students  with  higher  initial  levels  of  reading  achievement  on 
word  reading,  fluency,  or  comprehension  would  benefit  more 
from  the  intervention.  On  the  basis  of  previous  reading  inter¬ 
vention  work  for  ELs  we  hypothesized  more  benefits  of  the 
multicomponent  intervention  for  ELs  on  word  reading  out¬ 
comes  than  for  their  non-EL  peers. 

Method 

Participants 

Four  hundred  fifty-one  Grade  4  students  who  scored  at  or  below 
the  30th  percentile  on  the  reading  comprehension  subtest  of  the 
Gates-MacGinitie  Reading  Tests  (GMRT;  MacGinitie,  MacGini- 
tie,  Maria,  Dreyer,  &  Hughes,  2006)  were  selected  for  the  study. 
The  students  came  from  16  public  elementary  schools  located 
across  six  school  districts  in  three  states.  One  school  district  was 
located  in  a  large,  urban  metropolitan  area;  one  district  was  located 
in  a  midsize  city;  and  four  districts  were  located  in  rural  areas. 
Male  students  made  up  49%  of  the  sample.  With  regard  to  ethnic¬ 
ity,  46%  of  the  students  were  identified  as  Hispanic.  Of  those  who 


reported  language  status,  13.2%  of  the  total  sample  was  flagged  as 
having  a  primary  language  other  than  English  or  as  currently 
receiving  EL  services.  All  schools  provided  only  instruction  in 
English.  The  racial  composition  of  the  sample  was  35%  Black, 
44%  White,  17%  American  Indian,  1%  Asian,  and  2%  multiracial. 
Eighty-five  percent  of  the  students  qualified  for  low  income  or  free 
or  reduced  lunch  programs.  Fifteen  percent  were  identified  as 
having  a  disability.  The  majority  of  students  with  a  disability  were 
identified  with  a  learning  disability  or  a  speech/language  disability. 
There  were  no  differences  in  any  of  the  demographics  between  the 
two  study  groups. 

A  total  of  40  students  (9%  of  total  sample)  withdrew  from  their 
respective  schools  after  the  screening  test.  Attrition  was  12%  (n  = 
27)  in  the  treatment  group  and  6%  (n  =  13)  in  the  comparison 
group.  By  applying  guidelines  set  forth  by  What  Works  Clearing¬ 
house  (2014),  it  was  observed  that  the  overall  attrition  of  9%  and 
differential  attrition  of  6%  falls  into  a  category  of  low  attrition, 
which  is  operationalized  as  a  condition  where  the  balance  between 
overall  and  differential  attrition,  “.  .  .  is  expected  to  result  in  an 
acceptable  level  of  bias  even  under  the  conservative  assumptions” 

(p.  12). 

Procedures 

Screening  and  assignment.  Research  staff  screened  all  con¬ 
sented  fourth  grade  students  at  the  1 6  schools  during  the  fourth  or 
fifth  week  of  school  using  the  reading  comprehension  subtest  of 
the  GMRT.  All  students  scoring  at  or  below  the  30th  percentile  on 
this  measure  were  identified  for  the  study  and  randomly  assigned 
within  school  to  treatment  (Passport;  n  =  226)  or  comparison  ( n  = 
225)  using  stratification  on  the  screening  measure. 

Students  assigned  to  the  treatment  group  were  subsequently 
assigned  within  school  to  small  groups  of  four  to  seven  students  (a 
total  of  43  groups  across  schools).  Each  treatment  group  received 
the  Passport  to  Literacy  intervention  daily  for  30  min  sessions  for 
25  weeks.  Students  assigned  to  the  comparison  group  received  the 
typical  services  provided  by  the  school. 

Data  collection.  Following  screening,  pretest  measures  were 
administered  at  the  end  of  September  and  beginning  of  October  to 
all  participants.  Posttest  assessments  were  administered  in  early 
May,  within  2  weeks  of  the  intervention  completion.  Assessments 
were  counterbalanced  by  measure  and.  were  administered  by 
trained  research  assistants  blind  to  condition  and  assignment.  Prior 
to  pretesting  and  posttesting,  assessment  staff  were  required  to 
demonstrate  100%  accuracy  in  administration  and  scoring  on  all 
measures.  Further,  all  measures  were  double-scored  and  double- 
entered  by  two,  independent  research  staff. 

We  observed  students’  school  provided  reading  instruction. 
First,  we  collected  data  on  students’  core,  classroom  reading 
instruction  (Tier  1)  in  the  fall  and  in  the  spring  to  understand  the 
type  and  amount  of  reading  instruction  students  received  in  their 
classrooms.  Observers  were  trained  to  use  the  Instructional  Con¬ 
tent  Emphasis  Instrument-Revised  (ICE-R;  Edmonds  &  Briggs, 
2003)  to  record  what  was  taught,  how  long  it  was  taught,  and  the 
instructional  grouping  used  for  teaching.  Following  the  guidelines 
of  the  ICE-R,  specific  instructional  activities  were  coded  if  they 
lasted  for  at  least  1  min.  Content  categories  included  phonemic 
awareness,  phonics/word  recognition,  fluency,  vocabulary/oral 
language  development,  comprehension,  spelling,  text  reading  sep- 
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arate  from  other  instruction,  and  nonliteracy  activities  (e.g.,  other 
academic  instruction,  noninstructional  time).  Observers  also  coded 
instructional  groupings  as  whole  class,  small-group,  pairs,  inde¬ 
pendent  activity/assignment,  or  individualized  instruction.  Student 
engagement  for  the  overall  observation  was  coded  using  a  three- 
point  rubric  (3  =  high  engagement ,  1  =  low  engagement).  Finally, 
observers  assigned  a  global  quality  of  instruction  rating  for  the 
overall  observation  based  on  a  4-point  Likert  scale  ranging  from  1 
(weak)  to  4  (excellent).  This  global  instructional  quality  variable 
considered  a  teacher’s  use  of  direct  and  explicit  language,  model¬ 
ing,  students’  opportunities  for  practice,  specific  feedback,  moni¬ 
toring  and  encouragement  of  engagement,  scaffolding  of  tasks,  and 
pacing  throughout  the  lesson. 

We  used  a  multiple-step  training  process  to  establish  interrater 
reliability  for  the  Tier  1,  classroom  reading  instruction  observa¬ 
tions  in  fall  and  again  in  the  spring  before  each  round  of  obser¬ 
vations  began.  Initially,  each  observer  was  instructed  on  the  mean¬ 
ing  of  each  code/indicator  and  provided  specific  examples.  Next 
the  coding  process  was  modeled  by  the  principal  investigator  of 
the  project  using  a  short  video  segment  of  reading  instruction  from 
another  project.  Finally,  each  observer  practiced  coding  using 
several  novel  video  segments  that  were  subsequently  discussed 
with  the  principal  investigator.  Each  observer  established  90%  or 
higher  coding  accuracy  with  the  principal  investigator  (i.e.,  gold 
standard  approach)  on  a  separate  video  segment  of  reading  instruc¬ 
tion.  Observers  reestablished  reliability  prior  to  spring  observa¬ 
tions  with  new  video  segments.  All  coders  were  required  to  be 
above  90%  reliability  at  each  time  point.  Exact  interrater  reliability 
across  coders  and  time  periods  was  95.1%. 

To  identify  any  supplemental  reading  instruction/intervention, 
research  staff  completed  brief  interviews  with  classroom  teachers 
regarding  additional  reading  support  beyond  core  reading  instruc¬ 
tion  for  each  participating  student.  Each  semester  teachers  indi¬ 
cated  the  session  time,  frequency,  grouping,  implementer,  and 
implementer’s  credentials.  All  supplemental  intervention  sessions 
in  both  study  conditions  were  audio  recorded  at  three  time  points 
during  the  school  year  (fall,  winter,  and  spring);  recordings  of 
instruction  were  then  coded  using  the  ICE-R  measure  to  describe 
any  interventions  students  received. 

In  addition,  the  fidelity  of  implementation  of  the  Passport  to 
Literacy  intervention  was  monitored  monthly  via  direct  observa¬ 
tions  of  lessons  with  a  measure  specific  to  the  required  compo¬ 
nents  of  the  Passport  to  Literacy  intervention.  Interventionists  were 
observed  and  scored  on  implementation  of  each  activity,  student 
academic  engagement,  and  quality  of  instruction  for  each  lesson 
component.  The  scale  for  implementation  ranged  from  0  (teacher 
did  not  complete  elements  of  component)  to  3  (all  or  nearly  all 
required  elements  completed),  while  engagement  and  instructional 
quality  were  also  rated  from  1  (weak  engagement  or  quality)  to  3 
(excellent  engagement  or  quality).  Instructional  quality  indicators 
included  ongoing  monitoring,  redirection  of  off-task  behavior, 
positive  and  corrective  feedback,  organization  of  materials,  and 
appropriate  selection  of  additional  items  for  practice  when  needed. 
Each  observer  obtained  a  minimum  reliability  of  90%  in  compar¬ 
ison  to  a  gold  standard  rating  by  the  project  coordinator  prior  to 
formal  data  collection;  across  three  observers,  reliability  was 
95.3%. 


Description  of  Instruction 

Tier  1,  classroom  reading  instruction.  Data  from  observa¬ 
tions  of  core  reading  instruction  received  by  all  participating 
students  indicated  that  the  length  of  reading  classes  was,  on  aver¬ 
age,  75.40  min  (SD  =  26.34).  Within  this  instruction,  activities 
devoted  to  reading  comprehension  and  vocabulary  development 
were  most  prevalent,  accounting  for  nearly  35  min  (46%)  of  total 
time.  Instruction  devoted  to  word  analysis/decoding  was  minimal 
(<1  min  [<  1%  of  time]),  while  time  spent  in  reading  of  connected 
text  and/or  reading  fluency  practice  was  approximately  9  min 
(12%  of  time)  daily.  Of  note,  approximately  15  min  (20%  of  time) 
was  spent  in  differentiated  instructional  activities  where  students 
in  the  class  were  engaged  in  different  activities  simultaneously. 
The  additional  14  min  (19%)  of  time  was  spent  in  other  types  of 
activities  (e.g.,  transitions).  Core  reading  instruction  primarily 
occurred  as  whole-class  instruction  (approximately  45  min  or  60% 
of  time  on  average).  Just  less  than  10  min  (13%)  of  instructional 
time  consisted  of  students  working  independently  on  the  same 
activity,  while  approximately  8  min  (11%)  was  spent  in  either 
small-group  or  paired  instructional  activities.  Generally,  the  global 
ratings  of  instruction  for  the  core  classroom  instruction  were 
suggestive  of  high  average  instructional  quality  (M  =  3.17,  SD  — 
.59).  Similarly,  academic  engagement  by  students  during  core 
reading  instruction  was  rated  as  high  (M  =  2.78,  SD  =  .55). 

School-provided  supplemental  instruction.  A  total  of  130 
students  (n  —  62  treatment  [27%];  n  =  68  comparison  [30%])  also 
received  supplemental  intervention  provided  by  their  respective 
schools  for  all  or  part  of  the  year.  Teacher  reports  indicated  that 
this  supplemental  reading  intervention  was  most  often  delivered  by 
classroom  teachers  (20%)  or  other  certified  teachers  (43%  of 
students)  with  eight  interventions  (18%)  delivered  by  a  parapro- 
fessional  or  a  volunteer,  and  6  interventions  (14%)  delivered  by 
speech-language  pathologists.  Interventions  most  often  held  ses¬ 
sions  between  31  and  50  min  (70%)  with  16%  of  the  interventions 
meeting  between  21  and  30  min  and  10%  between  10  and  20  min. 
Seventy  percent  of  the  interventions  were  held  in  group  sizes  of 
one  to  five  students.  Nine  students  received  two  supplemental 
interventions  during  the  school  day. 

Across  the  2  years,  based  on  recordings  of  this  instruction, 
intervention  sessions  averaged  28.34  min  (SD  =  13.78).  The  most 
frequent  instructional  activities  involved  those  related  to  compre¬ 
hension  of  text  (M  =  8.27  min,  SD  =  7.60)  with  about  29%  of 
intervention  time,  as  well  as  vocabulary  and  oral  language  devel¬ 
opment  (M  =  4.45  min,  SD  =  5.90)  for  about  16%  of  intervention 
time.  Text  reading  without  other  instruction  occurred  for  approx¬ 
imately  6  min  (M  =  6.43  min.  SD  =  5.1)  or  23%  of  intervention 
time,  and  students  received  phonics/decoding  instruction  for  an 
average  of  3.84  min  (SD  =  7.86)  or  14%  of  intervention  time. 
Minimal  instruction  (0-4%  of  intervention  time)  was  focused  on 
oral  reading  fluency  practice  (M  =  .53* min,  SD  =  1.71),  spelling 
(M  —  1.22  min,  SD  =  3.27),  or  phonemic  awareness  (M  =  .04 
min,  SD  =  .23).  During  the  additional  reading  intervention,  an 
average  of  1.86  min  (SD  =  3.74)  or  7%  of  instructional  time  was 
spent  in  other  academic  instruction.  About  4%  of  the  intervention 
time  was  spent  in  noninstructional  activities  (M  ~  1 .04  min,  SD  = 
3.68).  The  mean  rating  of  instructional  quality  for  students  who 
received  supplemental  reading  instruction  was  2.83  (SD  =  .47) 
and  student  engagement  was  also  high  (M  =  2.65,  SD  =  .36). 
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Table  1  provides  information  on  this  typical  school  instruction  in 
comparison  to  the  treatment  intervention  sessions. 

Passport  to  Literacy  intervention.  We  provided  the  standard 
implementation  of  the  Passport  to  Literacy  intervention  program  at 
the  fourth-grade  level  to  students  in  the  treatment  condition.  Pass¬ 
port  to  Literacy  is  designed  to  be  used  as  a  supplemental  reading 
intervention  provided  in  small  groups  daily  for  30  min  sessions  for 
1  school  year  (up  to  120  lessons).  We  scheduled  the  intervention 
sessions  with  the  school/teachers  outside  of  their  core,  classroom 
reading  instruction  block,  typically  during  the  time  that  schools 
had  already  designated  for  intervention/enrichment. 

The  Passport  to  Literacy  intervention  is  broken  into  12,  10-day 
adventures,  with  each  lesson  targeting  phonics  and  word  recogni¬ 
tion,  fluency,  vocabulary,  and  comprehension.  To  monitor  stu¬ 
dents’  mastery  of  content  and  progress  on  oral  reading  fluency, 
checkpoints  are  designed  at  the  fifth  and  10th  lesson  of  each 
adventure.  The  sequence  of  instruction  began  with  an  Adventure 
Starter  activity  (approximately  3-5  min)  to  build  background 
knowledge  by  linking  the  lessons  and  readings  to  the  adventure. 
Then,  lessons  included  two  major  components;  the  first,  Word 
Works,  or  word  study,  taught  students  to  read  and  understand 
unknown  multisyllabic  words  using  strategies  to  break  words 
down  into  smaller  parts,  including  affixes,  roots,  and  syllabication. 
For  the  first  6  weeks,  the  Word  Works  instruction  was  20  min  and 
also  included  more  basic  word  reading  skills  such  as  letter/sound 
identification,  decoding,  sight  word  reading,  word  families,  and 
spelling  instruction.  In  subsequent  lessons,  Word  Works  was  re¬ 
duced  to  5  min,  but  also  included  a  brief  2  min  Warm-Up  where 
students  received  additional  word  study  practice  through  review 
and  application  of  previously  learned  letter  combinations,  sight 
words,  spelling  rules,  and  word  endings. 

Then,  during  the  second  component,  Read  to  Understand,  stu¬ 
dents  were  taught  the  meaning  of  vocabulary  words  introduced 
during  Word  Works,  as  well  as  comprehension  skills  and  strategies 
to  apply  while  reading  fiction  and  nonfiction.  For  example,  lessons 
offered  explicit  instruction  in  previewing,  setting  purpose,  text 
structure  and  evaluation,  making  inferences  and  taking  perspec¬ 
tives,  drawing  conclusions,  author’s  purpose,  sequencing,  main 
idea,  summarizing,  independent  reading  fix-up  strategies,  teacher 
and  reader  questioning,  and  making  connections  within  and  across 
texts.  In  the  first  6  weeks,  instruction  in  the  Read  to  Understand 
component  lasted  10  min  and  in  subsequent  lessons,  was  increased 

Table  1 


Average  Intervention  Instructional  Time  in  Minutes  and 
Percentage  of  Time  by  Study  Condition 


Instructional  component 

Passport 

intervention 

School-provided 

intervention 

No.  of 
min 

%  of  total 
time 

No.  of 
min 

%  of  total 
time 

Phonics  and  word  recognition 

3.29 

12 

3.84 

14 

Spelling 

1.32 

5 

1.22 

4 

Reading  fluency 

.26 

1 

.53 

2 

Vocabulary/oral  language 

6.05 

21 

4.45 

16 

Comprehension 

11.80 

41 

8.27 

29 

Non-instructional  text  reading 

4.72 

17 

6.43 

23 

Other  academic  instruction 

.27 

1 

1.86 

7 

Noninstruction 

.18 

1 

1.04 

4 

to  25  min.  Lessons  also  included  a  brief  focus  on  fluency  (reading 
with  appropriate  accuracy,  rate,  and  expression)  during  the  text 
reading. 

Intervention  teachers  and  training.  A  total  of  17  teachers, 
hired  by  the  research  team,  were  responsible  for  teaching  the 
Passport  to  Literacy  lessons.  All  the  teachers  had  a  bachelor’s 
degree,  four  (33.3%)  had  obtained  a  master’s  degree  in  education, 
and  one  had  a  PhD.  Twelve  of  the  interventionists  were  certified 
teachers  and  one  was  a  counselor.  The  other  four  had  degrees  in 
noneducation  areas.  All  intervention  teachers  were  female.  Three 
teachers  identified  themselves  as  Hispanic  ethnicity.  In  terms  of 
race,  11  (65.7%)  teachers  were  White  and  five  teachers  (29.4%) 
were  Black  and  one  chose  not  to  fill  in  the  information. 

Prior  to  the  start  of  instruction,  intervention  teachers  participated 
in  approximately  8  hr  of  training  over  the  course  of  2  days. 
Training  provided  by  the  project  coordinators  at  each  site,  allowed 
interventionists  to  become  oriented  to  the  project,  familiarize 
themselves  with  the  Passport  to  Literacy  intervention  program  and 
instructional  routine,  practice  implementation  of  lessons,  and  dis¬ 
cuss  positive  behavior  supports.  Once  intervention  sessions  with 
students  were  initiated,  twice  monthly  coaching  visits  were  con¬ 
ducted  by  the  project  coordinators.  These  visits  allowed  teachers  to 
receive  feedback  on  implementation  as  well  as  discuss  any  ques¬ 
tions  or  concerns.  Finally,  monthly  meetings  with  all  intervention 
teachers  were  held  at  each  site  to  provide  continued  support  and 
ensure  fidelity  of  implementation. 

Intervention  implementation  and  fidelity.  The  total  number 
of  Passport  to  Literacy  lessons  covered  for  each  of  the  intervention 
groups  ranged  from  83  to  106  sessions.  For  those  individual 
students  who  remained  in  the  school  for  the  duration  of  the 
intervention,  the  number  of  lessons  attended  ranged  from  a  low  of 
58  sessions  to  a  high  of  106  sessions  (M  =  93.79,  SD  =  7.82). 

As  noted  earlier,  each  intervention  teacher  recorded  three  inter¬ 
vention  lessons  during  the  year,  and  these  recordings  were  coded 
for  instructional  content  and  quality  using  the  ICE-R  to  directly 
compare  the  instructional  elements  in  Passport  and  the  school- 
provided  interventions.  On  average,  the  treatment  session  instruc¬ 
tion  was  28.56  min  (SD  =  4.07)  in  length.  Instruction  focused  on 
developing  students’  reading  comprehension  (M  =  11.80  [41%  of 
intervention  time],  SD  =  5.65)  and  vocabulary/oral  language  abil¬ 
ity  (M  =  6.05  [21%  of  intervention  time],  SD  ~  4.81).  During 
treatment  lessons,  students  engaged  in  text  reading  for  4.72  min 
(SD  =  2.43)  or  17%  of  intervention  time,  decoding  and  word 
reading  activities  for  3.29  min  (SD  =  3. 1 1)  or  12%  of  intervention 
time  and  practiced  spelling  for  just  over  1  min  (M  =  1.32,  SD  = 
2.34)  or  5%  of  intervention  time.  Explicit  instruction  in  oral 
reading  fluency  was  observed  for  0.26  min  (SD  =  0.92)  or  1%  of 
intervention  time,  on  average.  During  treatment  lessons,  less  than 
1  min  (1%)  of  time  was  considered  either  noninstructional  in 
nature  (M  =  0.18,  SD  =  0.64)  or  focused  on  instruction  in  another 
academic  area  such  as  writing  or  grammar  (M  =  .27,  SD  =  0.83). 
Ratings  of  instructional  quality  indicated  high-average  quality 
(M  =  3.37,  SD  =  .62)  and  on  average,  intervention  students  were 
engaged  during  instruction  (M  =  2.85,  SD  =  .43). 

In  terms  of  direct  fidelity  of  implementation  to  the  Passport  to 
Literacy  lessons,  mean  implementation  ratings  for  each  tutor  im¬ 
plementation  were  high,  ranging  from  2.71  to  3.00,  across  the 
lesson  components.  Similarly,  mean  ratings  of  student  academic 
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engagement  (2.85  to  3.00)  and  quality  of  tutor  instruction  (2.76  to 
3.00)  for  each  component  were  high. 

Dependent  Measures 

Project  staff  blind  to  condition  assessed  students’  word  reading, 
decoding,  vocabulary,  reading  fluency,  and  reading  comprehension  in 
the  fall  and  spring.  Because  of  the  high  correlation  between  students’ 
word  reading  and  oral  reading  fluency  (see  Table  2),  we  included  only 
the  word  reading  measures  in  the  dependent  variables,  but  examined 
possible  moderation  of  students’  fluency  on  outcomes. 

Woodcock-Johnson  III  Tests  of  Achievement  (WJIII; 
Woodcock,  McGrew,  &  Mather,  2001).  To  assess  word  read¬ 
ing  and  comprehension,  we  selected  four  individually  administered 
subtests  from  the  nationally  standardized  WJIII.  The  letter-word 
identification  subtest  measures  recognition  of  real  words,  and 
begins  with  individual  letters.  The  word  attack  subtest  measures 
decoding  skill  and  includes  items  that  are  pseudowords,  which 
begin  with  a  few  single  letter  sounds  and  progress  to  decoding  of 
complex  pseudowords.  The  picture  vocabulary  test  asks  students  to 
name  pictured  objects  increasing  in  difficulty.  The  passage  com¬ 
prehension  subtest  measures  how  well  students  can  read  text  with 
missing  words,  presented  as  a  cloze  procedure  in  which  students 
read  the  sentences  silently  and  are  asked  to  supply  the  missing 
word.  Test  authors  report  that  test-retest  reliability  for  these  four 
subtests  at  fourth  grade  are  .81,  85,  .77,  and  .86,  respectively. 

Dynamic  Indicators  of  Basic  Early  Literacy  Skills- 6th  Edi¬ 
tion  (DIBELS;  Good  &  Kaminski,  2002).  To  assess  student’s 
ability  to  read  connected  text  with  speed  and  accuracy,  we  admin¬ 
istered  the  oral  reading  fluency  (ORF)  subtest  from  DIBELS. 
Students  read  three  separate  passages  aloud  for  1  min  and  the  total 
number  of  correct  words  read  per  minute  from  the  passage  is 
considered  the  oral  reading  fluency  rate.  Test-retest  reliabilities  for 
ORE  with  elementary  age  students  range  from  .92  to  .97 ;  alternate- 
form  reliability  across  passages  from  the  same  level  is  reported  as 
.89  to  .94  (Good  et  al.,  2004). 


GMRT  (MacGinitie  et  al.,  2006).  The  GMRT  is  a  group- 
administered,  norm-referenced  test.  We  administered  the  vocabu¬ 
lary  and  comprehension  subtests.  The  fall  reading  comprehension 
scores  were  used  to  screen  students  for  inclusion  in  the  study. 
Vocabulary  presents  words  in  context.  The  student  chooses  the 
correct  meaning  of  the  target  word.  Comprehension  provides  stu¬ 
dents  with  reading  passages  and  multiple  choice  questions.  Ques¬ 
tions  address  facts,  inferencing,  and  drawing  conclusions.  Test- 
retest  reliabilities  are  above  .85.  Construct  validity  estimates  range 
from  .79  to  .81. 

Analytic  Approach 

For  both  research  questions,  a  longitudinal,  multilevel  structural 
equation  modeling  (ML-SEM)  framework  was  used  to  estimate 
primary  and  conditional  impacts.  A  structural  equation  model 
approach  is  useful  as  it  minimizes  the  limitation  of  measurement 
error  inherent  to  individual  observed  measures  by  leveraging  the 
common  variance  across  multiple  assessments  of  a  construct. 
Common  specifications  of  the  ML-SEM  for  randomized  controlled 
trials  include  latent  factors  of  pretest  and  posttest  measures  at  both 
a  lower  level  unit,  such  as  students,  and  at  an  upper  level  unit  (e.g., 
classrooms).  Similar  to  multilevel  models  of  observed  outcomes, 
the  ML-SEM  includes  the  regression  of  posttest  on  pretest  but  in 
this  case  with  latent  variables.  Estimation  of  the  treatment  effect 
may  occur  through  one  of  two  common  approaches.  One  method¬ 
ology  includes  the  simple  regression  of  the  posttest  on  k- 1  dummy 
codes  for  a  grouping  variable,  where  k  is  the  number  of  treatment 
arms,  to  reflect  whether  an  individual  received  the  intervention  or 
not.  An  alternative  approach  does  not  include  a  variable  for  treat¬ 
ment  status,  but  rather  tests  for  group  differences  through  a  mul¬ 
tiple  group  invariance  approach.  In  this  instance  the  test  of  impact 
is  estimated  by  inspecting  the  posttest  means  for  invariance  be¬ 
tween  groups  when  constraining  other  parameters  of  the  model  to 
be  equal  (e.g.,  loadings,  residual  variances,  regression  of  posttest 
on  pretest).  The  difference  in  standardized  posttest  means  between 


Table  2 


Descriptive  Statistics  and  Correlations  for  Study  Measures 


Variable 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

1.  Fall  GMRT  RC 

— 

2.  Fall  WJ  PC 

.32 

— 

3.  Fall  WJ  LWID 

.30 

.60 

— 

4.  Fall  WJ  WA 

.26 

.52 

.77 

— 

5.  Fall  GMRT  Voc 

.39 

.49 

.52 

.41 

— 

6.  Fall  WJ  PV 

.14 

.51 

.25 

.12 

.33 

— 

7.  DIBELS  ORF 

.29 

.51 

.70 

.62 

.46 

.13 

_ 

8.  Spring  GMRT  RC 

.32 

.46 

.38 

.32 

.43 

.23 

.44 

_ 

9.  Spring  WJ  PC 

.35 

.64 

.54 

.43 

.50 

.43 

.47 

.47 

_ 

10.  Spring  WJ  LWID 

.29 

.60 

.82 

.72 

.49 

.21 

.69 

.39 

.61 

11.  Spring  WJ  WA 

.24 

.49 

.76 

.76 

.44 

.19 

.60 

.30 

.50 

.79* 

12.  Spring  GMRT  Voc 

.31 

.55 

.51 

.41 

.64 

.34 

.49 

.64 

.53 

.54 

.46 

13.  Spring  WJ  PV 

.17 

.52 

.33 

.16 

.39 

.74 

.23 

.26 

.54 

.36 

.26 

.43 

M 

440.61 

481.92 

484.78 

490.32 

445.93 

486.44 

80.35 

456.69 

487.54 

493.01 

495.90 

462.06 

491.1 1 

SD 

19.37 

12.16 

18.97 

16.55 

27.51 

12.41 

26.87 

24.13 

9.66 

17.85 

14.40 

30.67 

11.91 

N 

412 

409 

409 

409 

328 

409 

410 

405 

404 

404 

404 

406 

404 

%  missing  data 

.0% 

.7% 

.7% 

.7% 

20.4% 

.7% 

.5% 

1.9% 

1.9% 

1.9% 

1.9% 

1.5% 

1.9% 

Note.  GMRT  RC  =  Gates-McGinitie  Reading  Comprehension;  WJ  PC  =  WJ-III  Passage  Comprehension;  WJ  LWID  =  WJ-III  Letter  Word 
Identification;  WJ  WA  =  WJ-III  Word  Attack;  GMRT  Voc  =  Gates-MacGinitie  Vocabulary;  WJ  PV  =  WJ-III  Picture  Vocabulary.  All  correlations 
statistically  significant  at  least  p  <  .05. 
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groups  then  represents  the  standardized  effect  size  difference. 
ML-SEMs  have  received  fair  attention  in  the  literature  as  of  late 
(e.g.,  Goddard,  Goddard,  Kim,  &  Miller,  2015;  Heck  &  Thomas, 
2015)  as  a  method  to  not  only  overcome  measurement  issues  but 
also  in  increasing  power  to  detect  effects  due  to  latent  variables 
increasing  reliability  of  the  measured  construct.  A  known  property 
of  effect  sizes  is  that  they  are  negatively  related  to  unreliability  of 
measurement.  Subsequently,  with  greater  precision  in  measure¬ 
ment  through  the  latent  variable,  it  is  possible  to  detect  larger 
effects  that  may  not  be  possible  with  observed  variable  error. 

Despite  the  increasing  prevalence  of  ML-SEM  in  the  literature 
for  testing  treatment  effects,  a  limitation  in  application  has  been  to 
randomized  designs  where  not  all  units  are  nested.  In  partially 
nested  randomized  controlled  trials  (PN-RCT;  Baldwin,  Bauer, 
Stice,  &  Rohde,  2011;  Lohr,  Schochet,  &  Sanders,  2014),  only 
some  individuals  are  nested  within  a  group.  For  the  present  study, 
the  partial  nesting  is  observed  where  students  receiving  the  inter¬ 
vention  were  all  nested  within  small  groups  but  the  comparison 
students  were  not.  Baldwin  et  al.  (2011)  noted  that  in  their  review 
of  studies  with  PN-RCT  designs,  researchers  frequently  ignored 
this  structure  to  the  detriment  of  standard  error  estimation.  Al¬ 
though  robust  methods  have  been  proposed  that  model  observed 
measures  for  PN-RCT  designs,  less  attention  has  been  given  to  the 


treatment  of  PN-RCT  data  in  the  ML-SEM  context.  Sterba  et  al. 
(2014)  presented  an  approach  within  Mplus  that  allows  an  indi¬ 
vidual  to  match  the  ML-SEM  methodology  to  the  PN-RCT  design. 
However,  a  limitation  of  reported  approaches  for  observed  and 
latent  variable  approaches  for  PN-RCT  data  is  that  they  involve  the 
introduction  of  ancillary  variables  into  the  data,  as  well  as  addi¬ 
tional  model  specifications  (e.g.,  adjusting  estimation  of  the  de¬ 
nominator  degrees  of  freedom  for  observed  variables)  that  are  not 
possible  to  implement  across  commonly  used  software. 

A  more  naturalistic  approach  to  treating  PN-RCT  data  is  to  view 
the  nesting  structure  through  n-level  SEM  (uSEM;  Mehta  &  Neale, 
2005)  which  easily  accommodates  complex  nesting.  Within 
nSEM,  observed  and  latent  variables  may  be  used  across  multiple 
levels.  The  concept  of  level  in  uSEM  takes  on  unique  meaning 
differing  from  multilevel  modeling.  That  is,  a  level  typically  refers 
to  a  unit  of  clustering  for  one  set  of  observations  within  another 
unit  such  as  students  nested  within  classrooms.  A  level  in  nSEM 
refers  to  this  type  of  nesting  but  further  describes  any  meaningful, 
nominal  grouping  of  individuals  such  as  male  or  female,  students 
eligible  for  free/reduced  lunch  or  not,  or  those  who  received  an 
intervention  or  not.  This  more  flexible  use  of  level  allows  us  to 
more  naturalistically  situate  the  PN-RCT  design  in  the  nSEM 
framework.  Consider  a  sample  rcSEM  model  in  Figure  1  that  is 


Figure  1.  Sample  n-level  structural  equation  measurement  model  for  partially  nested  designs. 
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relevant  to  the  current  study.  Note  that  there  are  four  boxes  that  are 
each  representative  of  participant  groupings.  Pertaining  to  stu¬ 
dents,  there  are  two  levels  of  groupings  one  for  the  Passport 
students  (Level  1)  and  one  for  comparison  students  (Level  2). 
Small  group  represents  a  nesting  structure  for  only  the  Passport 
students  (Level  3)  and  Classrooms  represent  the  nesting  of  stu¬ 
dents  from  both  student  groups  in  classrooms  (Level  4).  Figure  1 
then  represents  a  four-level  partially  nested,  cross-classified  SEM 
where  the  comparison  students  are  nested  within  classrooms  and 
the  Passport  students  are  cross-classified  by  small  groups  and 
classrooms. 

At  this  point,  it  may  useful  to  provide  an  introduction  to  more 
specific  components  of  the  model.  For  both  the  Passport  and 
comparison  levels,  the  SEM  specifies  that  there  is  a  posttest  (t]|  for 
Passport  and  t)i  for  comparison),  where  the  superscript  notation 
denotes  the  level  for  the  parameter  and  the  subscript  denotes  the 
parameter  number.  Thus,  -r|  J  is  the  first  Level- 1  latent  variable, 
(i.e.,  the  Passport  posttest  latent  variable)  and  is  the  first  Level-2 
latent  variable  for  the  comparison  group  at  the  posttest.  r\\  then  is 
the  second  latent  variable  for  the  Passport  group  (i.e.,  the  pretest) 
and  r\2  is  the  pretest  latent  variable  for  the  comparison  group.  The 
latent  variables  in  Passport  are  indicated  by  the  four  measures  Y\ 
to  Y\,  two  at  pretest  and  the  same  two  at  posttest,  as  are  the  latent 
variables  for  comparison  group  indicated  by  the  same  measures  Y\ 
to  Y\.  Each  of  the  observed  measures  has  a  residual  (0)  and  loading 
(A).  Note  that  the  loading  subscripts  are  the  same  from  posttest  to 
pretest  and  between  the  Passport  and  comparison  groups.  This 
specification  denotes  that  the  model  constrains  the  estimated  val¬ 
ues  to  be  equal  across  groups,  as  it  does  also  for  the  residual 
variances  and  the  regression  of  the  posttest  latent  construct  on  the 
pretest  ((3).  Across  all  four  levels,  there  are  latent  means  (a)  and 
variances  (i)r).  As  a  multilevel  model,  only  the  latent  means  at  the 
student  levels  (i.e.,  Passport  and  comparison)  are  estimated;  they  are 
fixed  at  0  at  the  small  group  and  classroom  levels.  Similar  to  a 
longitudinal  SEM,  the  pretest  means  (not  reflected  in  the  diagram)  are 
set  at  0  and  the  variances  are  fixed  at  1 .  This  specification  is  so  that 
the  means  at  the  posttest  are  standardized  such  that  the  difference 
between  aj  and  aj  is  the  standardized  treatment  effect. 

The  model  building  process  for  the  PN-RCT  rcSEM  occurred  in 
two  phases  with  four  models  each.  Phase  1  was  focused  on  testing 
longitudinal  invariance  of  the  loadings  and  intercepts,  and  Phase  2 
tested  between-level  posttest  invariances.  Within  Phase  1,  three 
models  were  tested:  (1)  freed  loadings  and  intercepts  across  pretest 
and  posttest  latent  variables  in  treatment  and  comparison  groups 
(Model  1);  (2)  invariant  loadings  and  freed  intercepts  across  pre¬ 
test  and  posttest  latent  variables  in  treatment  and  comparison 
groups  (Model  2);  (3)  invariant  loadings  and  intercepts  across 
pretest  and  posttest  latent  variables  in  treatment  and  comparison 
groups  (Model  3).  These  steps  were  necessary  to  evaluate  whether 
a  fully  invariant  model  for  intercepts  and  loadings  was  plausible 
such  that  the  latent  means  are  reflective  of  actual  latent  mean 
differences  and  not  loading/intercept  structure  differences.  For 
Phase  2,  five  models  were  tested  to  test  for  posttest  invariance 
across  combinations  of  the  treatment,  comparison,  and  small  group 
levels:  (1)  freed  loadings  and  intercepts  across  treatment,  compar¬ 
ison,  and  small  group  levels  (Model  4);  (2)  invariant  loadings  and 
freed  intercepts  between  treatment  and  comparison  levels  (Model 
5);  (3)  invariant  loadings  and  intercepts  between  treatment  and 
comparison  levels  (Model  6);  (4)  invariant  loadings  and  freed 


variances  between  treatment  and  small  group  levels  (Model  7),  and 
(5)  invariant  loadings,  intercepts,  pretest  means,  and  variances 
across  treatment,  comparison,  and  small  group  levels  (Model  8). 
Each  set  of  eight  models  were  applied  to  reading  comprehension, 
word  reading,  and  vocabulary  outcomes.  Exploratory  analyses  in 
the  study  tested  whether  EL  status,  pretest,  letter-word  identifica¬ 
tion,  or  oral  reading  fluency  moderated  the  relation  between  treat¬ 
ment  status  and  posttest  performance.  Model  comparisons  were 
made  using  the  deviance  statistic  as  well  as  the  Aikake  information 
criterion  and  Bayesian  information  criterion  indices.  A  log- 
likelihood  difference  test  was  used  for  hypothesis  testing  of  model 
differences. 

Results 

Descriptive  Statistics  and  Correlations 

A  preliminary  review  of  the  data  for  missingness  (see  Table  2) 
showed  that  complete  data  were  available  for  the  fall  GMRT-RC 
measure  (n  =  412),  but  missing  data  rates  varied  from  .7%  to 
20.4%  for  other  measures.  The  reason  for  the  high  level  of  missing 
data  on  the  fall  GMRT  vocabulary  measure  was  it  was  not  admin¬ 
istered  in  one  site  in  Year  1.  Little’s  missing  completely  at  random 
(MCAR)  test  suggested  that  all  missing  data  met  reasonable  as¬ 
sumptions  for  MCAR,  x2(81)  =  77.99,  p  >  .500;  thus,  using  full 
information  maximum  likelihood  for  model  estimation  was  appro¬ 
priate  and  would  not  negatively  bias  results. 

Table  2  presents  the  full  sample  student  performance  results  on 
the  individual  measures  of  reading  comprehension,  word  reading, 
and  vocabulary  at  fall  and  spring  and  Table  3  reports  means  and 
standard  deviations  by  treatment  condition.  Students’  scores  on  the 
measures  were  consistently  higher  at  the  spring  compared  to  fall. 
Correlations  among  the  measures  in  the  fall  ranged  from  .12 
between  WJIII  picture  vocabulary  and  word  attack  to  .77  between 
WJII  word  attack  and  letter-word  identification.  Spring  correla¬ 
tions  ranged  from  .26  between  WJIII  picture  vocabulary  and 

Table  3 


Descriptive  Statistics  of  Measures  by  Condition 


Measure 

Passport 

Comparison 

N 

M 

SD 

N 

M 

SD 

Fall  GMRT  RC 

199 

439.96 

19.96 

213 

441.23 

18.82 

Fall  WJ  PC 

199 

481.52 

11.67 

210 

482.30 

12.61 

Fall  WJ  LWID 

199 

484.43 

18.82 

210 

485.12 

19.14 

Fall  WJ  WA 

199 

488.91 

16.99 

210 

491.65 

16.03 

Fall  GMRT  Voc 

159 

444.87 

27.09 

169 

446.93 

27.93 

Fall  WJ  PV 

199 

486.85 

12.98 

210 

486.05 

11.84 

Fall  D1BELS  ORF 

198 

78.11 

25.58 

212 

82.44 

27.91 

Spring  GMRT  RC 

198 

459.25 

23.93 

207 

454.23 

24.11 

Spring  WJ  PC 

198 

488.12 

9.^5 

206 

486.98 

9.93 

Spring  WJ  LWID 

198 

492.79 

17.14 

206 

493.23 

18.54 

Spring  WJ  WA 

198 

495.47 

14.67 

206 

496.31 

14.21 

Spring  GMRT  Voc 

198 

462.08 

31.87 

208 

462.04 

29.55 

Spring  WJ  PV 

198 

491.70 

11.97 

206 

490.54 

11.85 

Sore.  GMRT  RC  —  Gates-MacGinitie  Reading  Comprehension;  WJ 
PC  =  WJ-III  Passage  Comprehension;  WJ  LWID  =  WJ-III  Letter  Word 
Identification;  WJ  WA  =  WJ-III  Word  Attack;  GMRT  Voc  =  Gates- 
MacGinitie  Vocabulary;  WJ  PV  =  WJ-III  Picture  Vocabulary;  ORF  = 
Oral  Reading  Fluency. 
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Table  4 


Confirmatory  Factor  Analysis  Model  Fit  Comparison  for  Latent  Reading  Comprehension,  Word 
Reading,  and  Vocabulary 


Outcome 

Model 

— 2LL 

df 

AIC 

BIC 

A-2LL 

A  df 

P 

Reading  comprehension 

1 

6766.20 

12 

6790 

6847 

2 

6766.21 

11 

6788 

6840 

3 

6766.85 

9 

6784 

6827 

.65 

2 

.723“ 

4 

6578.35 

18 

6614 

6698 

5 

6578.35 

17 

6612 

6692 

6 

6578.36 

16 

6610 

6686 

7 

6578.39 

18 

6612 

6692 

8 

6582.57 

12 

6606 

6662 

4.18 

6 

,652b 

Word  reading 

1 

6650.71 

12 

6675 

6731 

2 

6650.22 

11 

6672 

6724 

3 

6650.56 

9 

6673 

6716 

.3 

2 

.86 1 a 

4 

6364.37 

18 

6400 

6485 

5 

6364.36 

17 

6398 

6478 

6 

6364.62 

16 

6397 

6472 

7 

6364.37 

17 

6398 

6478 

8 

6366.09 

12 

6390 

6446 

1.72 

5 

.886b 

Vocabulary 

1 

6283.66 

12 

6308 

6363 

2 

6283.65 

11 

6305 

6356 

3 

6284.52 

9 

6303 

6344 

.87 

2 

.647“ 

4 

6933.29 

18 

6969 

7054 

5 

6933.29 

17 

6967 

7047 

6 

6933.47 

16 

6965 

7041 

7 

6933.29 

17 

6967 

7047 

8 

6934.07 

12 

6958 

7014 

.78 

5 

.978b 

Note.  — 2LL  =  — 2*log  likelihood;  AIC  =  Akaike  information  criterion;  BIC  =  Bayesian  information 
criterion.  Model  1  =  treatment-comparison,  pretest-posttest  freed  loadings  and  intercepts;  Model  2  =  treatment- 
comparison,  pretest-posttest,  invariant  loadings,  freed  intercepts;  Model  3  =  treatment-comparison,  pretest- 
posttest,  invariant  loadings  and  intercepts;  Model  4  =  treatment-comparison  -small  group  freed  loadings  and 
intercepts;  Model  5  =  treatment-comparison  invariant  loadings,  freed  intercepts;  Model  6  =  treatment- 
comparison  invariant  loadings  and  intercepts;  Model  7  =  treatment-small  group  invariant  loadings,  freed 
variances;  Model  8  =  treatment-small  group-comparison  invariance  loadings,  intercepts,  means,  and  variances. 
a  Model  is  compared  with  Model  2.  b  Model  is  compared  with  Model  7. 


GMRT  reading  comprehension  to  .79  between  WJII  word  attack 
and  letter-word  identification.  Stability  coefficients  from  fall  to 
spring  ranged  from  .32  for  GMRT  reading  comprehension  to  .82 
for  WJII  letter-word  identification,  suggesting  moderate  to  high 
stability  in  relative  rank  orders  of  individuals  over  time. 

Tests  of  Invariance 

Results  from  the  tests  of  invariance  are  presented  in  Table  4.  For 
the  first  phase  of  invariance  testing,  which  was  related  to  longitu¬ 
dinal  invariance  between  pretest  and  posttest  between  the  treat¬ 
ment  and  comparison  groups,  results  consistently  demonstrate  that 
imposing  incremental  equality  constraints  on  the  intercepts  and 
loadings  did  not  significantly  denigrate  fit.  This  step  is  important 
as  it  suggests  that  the  means  and  loadings  didn’t  differ  by  forcing 
them  to  be  equal  across  groups.  For  reading  comprehension,  the 
difference  in  deviance  between  Models  2  and  3  was  negligible 
(A— 2LL  =  0.65)  and  not  statistically  significant  (p  =  .723). 
Similarly,  no  significant  differences  were  observed  between  Mod¬ 
els  2  and  3  for  word  reading  (A— 2LL  =  0.30,  p  —  .861)  or 
vocabulary  (A-2LL  =  0.87 ,p  =  .647).  Phase  2  invariance  testing 
in  the  posttest  invariance  among  the  treatment,  comparison,  and 
small  groups  (Models  4  through  8)  show  that  no  substantive 
difference  was  observed  in  the  deviance  statistic.  In  fact,  the 
largest  difference  in  deviance  between  Model  4  (the  least  restric¬ 


tive  model)  and  Model  8  (the  most  restrictive  model)  was  for 
reading  comprehension  where  the  deviance  difference  was  <4 
points  with  six  degrees  of  freedom,  a  nonsignificant  finding. 
When  comparing  the  final  two  models,  no  significant  differ¬ 
ences  were  observed  for  reading  comprehension  (A-2LL  = 
4.18,  p  =  .652),  word  reading  (A— 2LL  =  1.72,  p  =  .886),  or 
vocabulary  (A  — 2LL  =  0.78,  p  =  .978). 

nSEM  Primary  Impact  Model  Results 

Primary  impact  model  results  for  the  three  latent  outcomes  of 
reading  comprehension,  word  reading,  and  vocabulary  related  to 
the  first  research  question  are  presented  in  Figures  2  and  3.  Using 
a  similar  methodology  for  comparing  the  factor  analytic  models, 
the  impact  analyses  tested  constrained  and  freed  estimate  versions 
of  the  nSEM  in  Figure  1 .  In  the  constrained  version  of  the  model, 
the  latent  posttest  means  for  the  Passport  and  comparison  groups 
(i.e.,  a|  and  aj;  Figure  1)  were  constrained  to  be  equal.  This 
constraint  was  relaxed  for  a  second  model  test  of  mean  difference. 
A  log-likelihood  difference  test  was  used  for  hypothesis  testing  of 
model  differences.  The  model  comparison  for  reading  comprehen¬ 
sion  (see  Table  4)  showed  that  the  model  with  freed  posttest  means 
fit  better  than  the  model  with  constrained  means  (A— 2LL  =  9.47, 
A  df=  1  ,p  <  .001).  Figure  2  shows  that  controlling  for  the  pretest 
relation  to  posttest  ((3  —  1.08),  the  standardized  mean  posttest 
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Figure  2.  Primary  impact  n-level  structural  equation  models  for  partial  nested  randomized  controlled  trial  for 
reading  comprehension. 


value  was  a  =  1.26  for  the  Passport  group  and  a  =  .88  for  the 
comparison  group,  a  statistically  significant  difference.  The  effect 
size  of  Passport  for  latent  reading  comprehension  outcomes  is 
calculated  as  the  difference  between  these  two  scores,  or  0.38.  No 
significant  differences  were  observed  between  the  constrained  and 
freed  posttest  means  models  for  latent  word  reading  (p  =  .280)  or 
latent  vocabulary  (p  =  .480).  Further,  no  substantive  primary 
impacts  for  Passport  were  observed  for  word  reading  (Aa  =  .06; 
Figure  3  top),  nor  was  there  an  impact  on  vocabulary  (Aa  =  .08; 
Figure  3  bottom). 

nSEM  Exploratory  Modeling  Results 

To  address  the  second  research  question,  exploratory  analyses 
evaluated  the  moderation  of  treatment  effects  based  on  EL  status 
and  selected  baseline  measures  (i.e.,  pretest,  letter-word  identifi¬ 
cation,  and  oral  reading  fluency).  As  previously  noted,  two  meth¬ 
ods  are  frequently  employed  to  test  for  treatment  effects  in  SEM 
studies  including  the  inclusion  of  k- 1  dummy  codes  or  multiple 
groups.  In  a  similar  manner,  moderation  of  treatment  effects  can  be 
tested  by  including  interaction  terms  in  a  regression  model,  or  by 
using  the  multiple  group  method.  The  moderators  for  our  explor¬ 
atory  analyses  were  a  combination  of  continuous  (i.e.,  baseline/ 


pretest,  letter-word  identification,  and  oral  reading  fluency)  and 
categorical  (i.e.,  EL).  As  such,  two  different  approaches  were  used 
for  tests  of  moderation. 

Three  baseline  moderation  models  were  tested.  The  first  mod¬ 
eration  model,  which  we  call  baseline  moderation  model,  tested 
the  impact  of  the  autoregressive,  latent  pretest  construct  and 
whether  the  relation  between  latent  pretest  and  posttest  varied  by 
group.  By  releasing  the  Beta  in  Figure  1  to  be  freely  estimated  for 
the  Passport  and  comparison  groups,  and  comparing  the  fit  of  this 
model  to  the  primary  impact  model  where  the  Beta  in  Figure  1  is 
constrained  to  be  the  same  between  the  two  groups,  a  test  is 
provided  as  to  whether  baseline  performance  moderates  the  treat¬ 
ment  effect.  The  second  and  third  moderation  models,  which  each 
used  single-item  indicators  of  letter  word  identification  and  ORF, 
was  done  by  first  creating  a  single-item  indicator  latent  construct 
for  the  moderator  of  interest  (i.e.,  where  the  loading  was  fixed  at 
1.0  and  the  residual  variance  was  set  at  a  reliability  adjusted 
estimate  of  the  sample  variance).  This  factor  was  set  as  a  predictor 
of  the  latent  posttest,  identical  to  the  Beta  parameter  in  Figure  1,  as 
well  as  set  to  covary  with  the  latent  pretest  for  both  Passport  and 
comparison  groups.  Estimation  for  this  type  of  model  required  two 
steps;  first,  the  path  from  the  baseline  measure  was  constrained  to 
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Figure  3.  Primary  impact  n-level  structural  equation  models  for  partial  nested  randomized  controlled  trial  for 
word  reading  (top)  and  vocabulary  (bottom). 


be  equal  between  Passport  and  comparison  groups.  Fit  from  this 
model  was  compared  with  a  model  where  the  Beta  constraint  was 
freed  for  estimation.  Improved  fit  for  a  freed  model  provided 
evidence  for  moderation. 

Results  for  the  three  tests  of  moderation  for  each  outcome  are 
reported  in  Table  5.  For  latent  reading  comprehension,  no  moderation 
was  observed  for  baseline  latent  reading  comprehension  (A-2LL  = 
0.00,  p  =  1.00)  or  baseline  oral  reading  fluency  (A-2LL  =  1 .00,  p  = 


.321),  but  statistically  significant  moderation  was  estimated  for  base¬ 
line  letter-word  identification  (A-2LL  =  14.87,  p  <  .001)  such  that 
students  with  higher  initial  word  reading  scores  performed  better  on 
reading  comprehension  in  the  treatment.  No  significant  moderation 
was  observed  for  any  of  the  selected  moderators  for  either  latent  word 
reading  or  vocabulary  outcomes  (see  Table  5). 

For  the  EL  indicator,  moderation  was  tested  by  fitting  the 
factor  models  from  Figure  1  separately  for  EL  and  non-EL 
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Table  5 

Fit  Comparison  for  Primary  Impact  Models  and  Moderation  with  EL,  Baseline,  Letter— Word  Identification,  and  Oral 


Reading  Fluency 


Outcome 

Type 

Model 

—  2LL 

df 

AIC 

BIC 

A-2LL 

A  df 

P 

Reading  comprehension 

Impact 

Constrained 

13167.32 

16 

13199 

13285 

9.47 

1 

.002 

Freed 

13157.85 

17 

13192 

13284 

EL  moderation 

Constrained 

3466.47 

16 

3498 

3563 

.068 

Freed 

3463.15 

17 

3497 

3566 

3.32 

1 

Non-EL  moderation 

Constrained 

9654.43 

16 

9686 

9768 

Freed 

9647.69 

17 

9682 

9768 

6.74 

1 

.009 

Baseline  moderation 

Constrained 

13164.84 

16 

13197 

13283 

Freed 

13164.83 

17 

13198 

13290 

.01 

1 

.920 

LWID  moderation 

Constrained 

16545.13 

24 

16593 

16728 

Freed 

16560.00 

25 

18610 

18750 

14.87 

1 

.000 

ORF  moderation 

Constrained 

16875.00 

24 

16923 

17058 

Freed 

16874.00 

25 

16923 

17064 

1.00 

1 

.320 

Word  reading 

Impact 

Constrained 

12485.20 

16 

12517 

12603 

.284 

Freed 

12484.05 

17 

12518 

12609 

1.15 

1 

EL  moderation 

Constrained 

3323.66 

16 

3356 

3421 

Freed 

3321.07 

17 

3355 

3424 

2.59 

1 

.108 

Non-EL  moderation 

Constrained 

9124.90 

16 

9157 

9239 

Freed 

9124.78 

17 

9159 

9245 

.12 

1 

.729 

Baseline  moderation 

Constrained 

12486.67 

16 

12519 

12605 

Freed 

12486.65 

17 

12521 

12612 

.02 

1 

.888 

LWID  moderation 

Constrained 

Freed 

ORF  moderation 

Constrained 

16032.00 

24 

16080 

16215 

Freed 

16031.00 

25 

16081 

16222 

1.00 

1 

.320 

Vocabulary 

Impact 

Constrained 

12826.17 

16 

12858 

12943 

Freed 

12825.67 

17 

12859 

12950 

.50 

1 

.480 

EL  moderation 

Constrained 

3025.1 

16 

3057 

3119 

Freed 

3025.06 

17 

3059 

3125 

.04 

1 

.841 

Non-EL  moderation 

Constrained 

9679.59 

16 

9712 

9793 

Freed 

9678.76 

17 

9713 

9799 

.83 

1 

.362 

Baseline  moderation 

Constrained 

12825.66 

16 

12858 

12943 

Freed 

12825.15 

17 

12859 

12950 

.51 

1 

.480 

LWID  moderation 

Constrained 

16237 

24 

16285 

16418 

Freed 

16235 

25 

16285 

16424 

2.00 

1 

.157 

ORF  moderation 

Constrained 

16550 

24 

16598 

16732 

Freed 

16549 

25 

16600 

16739 

1.00 

1 

.320 

Note.  -2LL  =  —  2Tog  likelihood;  AIC  =  Akaike  information  criterion;  BIC  =  Bayesian  information  criterion;  EL  =  English  learner;  LWID  =  letter  word 
identification;  ORF  =  oral  reading  fluency.  LWID  moderation  was  not  tested  for  the  latent  word  reading  outcome  as  it  was  part  of  the  latent  variable  itself 
and  included  in  the  pretest  construct. 


students  and  evaluating  Passport  and  comparison  group  posttest 
mean  differences  using  constrained  and  freed  posttest  means 
similar  to  the  primary  impact  model.  Relevant  results  for  the  EL 
student  model  (Table  5  and  Figure  4)  showed  no  statistically 
significant  difference  in  posttest  means  were  observed  for  read¬ 
ing  comprehension  (p  =  .068),  word  reading  {p  =  .108),  or 
vocabulary  (p  =  .841);  however,  the  mean  effect  size  difference 
in  Figure  4  shows  small  effects  in  favor  of  Passport  for  latent 
word  reading  (Aa  =  .54-0.35  =  0.19)  and  latent  reading 
comprehension  (Aa  =  1.42  -  1.04  =  0.38).  No  effect  of 
Passport  was  observed  for  EL  students  on  latent  vocabulary 
(Aa  =  .01).  A  statistically  significant  effect  of  Passport  was 
estimated  for  non-EL  students  on  reading  comprehension  (p  = 
.009;  Table  4)  with  an  effect  size  of  Aa  =  .39  (see  Figure  5). 
No  significant  effects  were  estimated  for  latent  word  reading 
( p  =  .729)  or  vocabulary  (p  =  .362);  however,  different  from 
the  other  analyses,  a  small  effect  on  vocabulary  was  estimated 
(Aa  =  .13;  Figure  5). 


Discussion 

In  this  study,  our  aim  was  to  contribute  to  the  relatively  limited 
body  of  research  on  effective  comprehensive  reading  interventions 
to  improve  reading  comprehension  for  upper  elementary  students 
by  extending  our  prior  work  examining  the  effects  of  a  widely 
used,  multicomponent,  upper  elementary  reading  intervention.  The 
present  study  adds  uniquely  to  the  existing  literature  by  employing 
a  large  sample,  using  latent  variables  based  on  standardized  read¬ 
ing  measures,  and  by  using  a  relatively  more  sophisticated  data 
analytic  method  (nSEM)  to  address  differences  in  nesting  within 
the  treatment  and  comparison  groups.  In  addition,  the  larger  sam¬ 
ple  also  allowed  us  to  examine  additional  moderators  such  as 
initial  baseline  reading  performance  and  EL  status  to  learn  more 
about  for  whom  the  intervention  was  most  effective.  The  treatment 
was  implemented  with  a  high  degree  of  fidelity  that  included 
approximately  94  sessions.  Thus,  the  study  is  not  only  rigorous  in 
design,  but  also  is  one  of  the  most  extensive  to  date  for  this  grade 
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Figure  4.  Exploratory  «-level  structural  equation  modeling  (SEM)  for  English  Learners  on  word  reading  (top) 
and  reading  comprehension  (bottom) 


level;  providing  a  fairly  optimal  test  of  the  possible  effects  of 
implementing  this  multicomponent  intervention  at  the  fourth  grade 
level. 

Our  first  research  question  addressed  main  effects  of  the  mul¬ 
ticomponent  intervention  on  reading  comprehension,  word  read¬ 
ing,  and  vocabulary.  Consistent  with  our  hypothesis  that  students 
with  reading  difficulties  receiving  the  intervention  would  outper¬ 


form  students  receiving  only  typical  school  services  in  reading 
comprehension,  we  did  find  a  significant  effect  of  the  intervention 
on  reading  comprehension  with  an  effect  size  of  0.38.  However, 
we  found  no  significant  effects  on  word  reading  (ES  =  0.05)  or  on 
vocabulary  (ES  =  0.08).  The  magnitude  of  the  effects  on  compre¬ 
hension  are  slightly  larger  than  in  our  previous  study  of  the 
Passport  to  Literacy  intervention,  which  found  effect  sizes  on 
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Student  at-risk,  Passport 


Figure  5.  Exploratory  /(-level  structural  equation  modeling  (SEM)  for  non-English  Learners  on  vocabulary 
(top)  and  reading  comprehension  (bottom). 


the  individual  measures  that  comprised  our  latent  variable  in  the 
present  study  (i.e.,  WJIII  passage  comprehension  [ES  =  0.14]  and 
the  GMRT  [ES  =  0.28]).  It  is  noteworthy  that  0.38  exceeds  the 
effect  size  criteria  of  0.25  for  substantively  important  impact  from 
the  What  Works  Clearinghouse  (2014).  On  the  basis  of  the  mean 
standard  scores,  students  in  the  comparison  group  appeared  to 
make  expected  progress  (1  year’s  worth  of  progress)  in  reading 


comprehension,  whereas  students  in  the  treatment  group  acceler¬ 
ated  their  learning.  In  other  words,  students  in  the  comparison 
group  didn’t  fall  any  further  behind  whereas  students  in  the  treat¬ 
ment  group  made  some  progress  toward  closing  the  gap  between 
their  current  level  of  performance  and  expected  levels  of  perfor¬ 
mance  in  reading  comprehension.  Importantly,  neither  group  of 
students  demonstrated  on  grade  level  performance  at  the  end  of  the 
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intervention,  although  the  accelerated  learning  in  reading  compre¬ 
hension  for  students  in  the  treatment  group  is  promising.  We  found 
no  significant  differences  between  study  groups  on  word  reading 
or  vocabulary.  Thus,  our  findings  suggest  participation  in  Passport 
to  Literacy  can  improve  student  reading  comprehension,  which  is 
a  finding  consistent  with  our  initial  work  (Wanzek  et  al.,  2016). 

That  we  found  no  main  effects  for  word  reading  or  vocabulary 
is  important,  particularly  as  it  is  consistent  with  our  prior  study  and 
suggests  that  for  students  with  weak  comprehension,  participating 
in  Passport  to  Literacy  would  likely  move  the  dial  on  only  on 
reading  comprehension.  This  is  likely  because,  although  the  pro¬ 
gram  is  multicomponent,  it  focuses  primarily  on  reading  compre¬ 
hension,  with  relatively  limited  word  work  or  in-depth  vocabulary 
instruction.  Our  observations  indicated  that,  as  designed,  on  aver¬ 
age  more  than  40%  of  the  treatment  intervention  was  devoted  to 
explicit  instruction  in  reading  comprehension.  In  contrast,  the 
percentages  of  implemented  intervention  devoted  to  vocabulary, 
text  reading,  decoding,  and  spelling  were  21%,  17%,  12%,  and 
5%,  respectively.  The  quality  for  this  instruction  was  fairly  high  as 
well,  indicating  students  received  explicit,  systematic  instruction 
in  reading  comprehension.  This  high-quality,  comprehension  em¬ 
phasis  in  the  intervention  may  explain  the  reading  comprehension 
outcomes  students  realized.  In  other  words,  the  fact  that  Passport 
to  Literacy  has  its  benefits  largely  in  the  area  of  reading  compre¬ 
hension  may  be  related  to  the  focus  of  the  intervention.  The  effect 
sizes  for  reading  comprehension  in  the  present  study  are  larger 
than  those  in  our  prior  study  (effect  sizes  ranged  from  0.14  to  0.28 
in  the  prior  study),  but  are  smaller  than  effect  sizes  reported  in  two 
other  multicomponent  interventions.  Specifically,  for  reading  com¬ 
prehension  measures,  Vadasy  and  Sanders  (2008)  reported  an 
effect  size  of  0.50  and  O’Connor  et  al.’s  (2002)  effect  sizes  ranged 
from  1.39  to  1.46.  By  contrast,  Ritchey  et  al.  (2012)  found  no 
significant  differences  on  a  standardized  measure  of  reading  com¬ 
prehension,  but  did  report  an  effect  size  of  0.56  on  a  researcher- 
developed  measure  of  comprehension  strategy  use. 

In  our  previous  study  of  the  effects  of  Passport  to  Literacy  with 
a  smaller  sample,  we  suggested  that  our  pattern  of  effects  (signif¬ 
icant  effects  for  reading  comprehension,  but  not  for  word  reading 
or  vocabulary)  might  be  related  to  the  amount  of  time  attributed  to 
narrative  and  expository  comprehension  and  word  reading  during 
the  lessons,  with  an  average  of  12  min  of  reading  comprehension 
instruction  and  6  min  of  vocabulary  instruction  in  a  typical  half 
hour  lesson,  compared  with  3  min  of  decoding  or  word  reading 
instruction.  In  contrast,  the  interventions  in  the  O’Connor  et  al. 
(2002)  and  Vadasy  and  Sanders  (2008)  studies  included  relatively 
more  fluency  practice  than  in  the  current  study,  perhaps  allowing 
students  to  access  greater  amounts  of  text  for  improving  their 
overall  reading  comprehension.  The  samples  in  the  studies  by 
O’Connor  et  al.  (2002)  as  well  as  Vadasy  and  Sanders  presented 
with  lower  overall  word  recognition  and  fluency  abilities  initially 
as  well.  Ritchey  et  al.  (2012)  emphasized  fluency  and  expository 
comprehension,  but  for  a  briefer  period  of  time  (24  sessions)  than 
O’Connor  et  al.  (2002),  Vadasy  and  Sanders,  or  the  current  study. 
The  brief  time  period  makes  it  difficult  to  directly  compare  the 
relationship  between  the  instruction  in  the  intervention  and  find¬ 
ings  to  these  other  more  lengthy  studies.  However,  the  current 
findings  seem  to  align  with  the  differences  in  intervention  focus, 
length  of  intervention,  and  results  of  the  previous  studies. 


Our  second  research  question  addressed  moderation,  to  help 
inform  for  whom  the  intervention  was  effective.  We  hypothesized, 
based  on  exploratory  findings  from  our  previous  study,  that  stu¬ 
dents  with  low  levels  of  initial  comprehension  might  demonstrate 
less  growth  than  students  with  better  initial  comprehension.  How¬ 
ever,  with  our  larger  sample  and  using  latent  variables,  we  found 
no  moderation  effects  for  initial  status  on  comprehension,  suggest¬ 
ing  the  intervention  was  equally  beneficial  for  students  at  all  levels 
of  initial  comprehension.  This  is  encouraging  for  practice  as  the 
intervention,  with  its  relative  emphasis  on  comprehension,  can 
assist  all  levels  of  struggling,  upper  elementary  students  in  im¬ 
proving  their  reading  comprehension.  There  was  also  no  modera¬ 
tion  of  the  intervention  effects  for  reading  comprehension  on  the 
basis  of  students’  initial  reading  fluency,  a  finding  that  aligns  with 
O’Connor  et  al.  (2002),  though  O’Connor  et  al.  categorized  stu¬ 
dents  into  lower  or  higher  fluency  students  based  on  a  break  point. 
We  examined  moderation  of  oral  reading  fluency  differences  as  a 
continuous  variable.  The  intervention  was  equally  beneficial  in 
improving  reading  comprehension  for  students  at  all  levels  of 
initial  reading  fluency.  However,  we  did  find  that  initial  individual 
differences  in  word  reading  ability  significantly  moderated  the 
effect  of  the  treatment,  with  students  who  entered  the  intervention 
at  lower  levels  of  word  recognition  making  less  progress  in  reading 
comprehension  than  students  who  entered  the  intervention  with 
higher  levels  of  word  reading.  An  implication  for  schools  is  that 
these  students  with  low  word  reading  may  require  a  reading 
intervention  that  incoiporates  more  word  study  before  they  can 
fully  benefit  from  an  intervention  that  emphasizes  reading  com¬ 
prehension.  The  relatively  brief  intensive  word  study  provided  at 
the  beginning  of  the  Passport  to  Literacy  intervention  may  not  be 
enough  for  students  with  low  word  recognition  to  make  the  same 
gains  as  those  entering  with  higher  levels  of  word  recognition. 
Torgesen  et  al.  (2001)  implemented  an  intensive  reading  interven¬ 
tion  largely  focused  on  word  recognition  for  students  with  very 
low  initial  word  reading  skills  and  reported  significant  gains  in 
standard  scores  across  word  reading  and  reading  comprehension. 
The  lack  of  control  group  in  the  Torgesen  et  al.  study  makes  it 
difficult  to  compare  effect  sizes  with  other  studies,  but  an  intensive 
intervention  with  a  heavier  emphasis  on  word  recognition  is  likely 
needed  for  students  with  the  lowest  word  recognition  abilities  at 
the  upper  elementary  level.  To  summarize,  the  Passport  to  Literacy 
intervention  provided  improvements  in  students’  reading  compre¬ 
hension  beyond  the  typical  school  services  for  students  at  varying 
levels  of  initial  reading  comprehension  or  reading  fluency  but  who 
had  relatively  higher  levels  of  word  reading  ability. 

Encouragingly,  the  effects  of  the  intervention  on  reading  com¬ 
prehension  were  similar  for  EL  and  non-EL  students  (ES  =  0.38 
and  0.39,  respectively),  suggesting  the  intervention  is  equally 
beneficial  and  appropriate  for  ELs  to  improve  their  reading  and 
understanding  of  English  text.  Practical  benefits  of  the  intervention 
were  noted  in  relation  to  word  reading  for  the  EL  students,  but  this 
was  not  a  significant  moderation.  Previous  work  reviewed  by 
Baker  et  al.  (2014)  demonstrated  that  both  younger  ELs  (kinder¬ 
garten  through  Grade  1)  and  older  ELs  (Grades  6  through  8) 
benefit  from  small  group  multicomponent  reading  interventions  in 
terms  of  word  reading  and  comprehension.  Wanzek  and  Roberts 
(2012)  also  noted  EL  status  moderated  effects  on  word  attack  and 
word  identification  with  the  EL  students  performing  better  than 
non-EL  students  following  intervention.  These  higher  effects  oc- 


1118 


WANZEK  ET  AL. 


curred  regardless  of  the  emphasis  of  the  intervention  (e.g.,  com¬ 
prehension  emphasis,  word  recognition  emphasis). 

Limitations 

Although  our  study  was  rigorous,  there  are  always  limitations 
involved  with  school-based  research.  To  ensure  a  strong  test  of  the 
efficacy  of  the  Passport  to  Literacy  intervention,  we  trained  re¬ 
search  staff  to  implement  the  intervention  with  a  high  degree  of 
fidelity  and  dosage  consistent  with  the  publisher’s  recommenda¬ 
tions.  Thus,  similar  effects  may  or  may  not  be  achieved  by  school 
personnel  depending  on  implementation.  We  also  recruited  schools 
that  were  diverse  and  served  students  from  low  socioeconomic 
backgrounds,  so  our  findings  might  not  generalize  to  schools 
serving  students  from  higher  socioeconomic  backgrounds.  The 
majority  of  our  ELs  in  our  study  were  Hispanic  and  our  findings 
may  not  generalize  to  students  from  other  language  backgrounds, 
particularly  those  with  orthographies  that  are  very  different  than 
English.  Further,  effect  sizes  are  interpretable  relative  to  the  com¬ 
parison  condition  in  the  participating  schools  where  very  few 
struggling  readers  received  supplemental  interventions  as  a  part  of 
their  typical  practice. 

Implications  and  Directions  for  Future  Research 

Teachers  and  school  leaders  face  challenges  in  identifying  ef¬ 
fective  reading  interventions  for  students  in  the  upper  elementary 
grades,  particularly  given  the  high  numbers  of  students  who  con¬ 
tinue  to  struggle  with  reading  after  third  grade  (National  Center  for 
Educational  Statistics,  2016).  The  increased  demands  placed  on 
students  beginning  in  fourth  grade  may  cause  a  slowing  of  actual 
versus  expected  growth  for  some  students  (Chall  &  Jacobs,  1983). 
Therefore,  fourth  grade  teachers  are  often  faced  with  the  challenge 
of  providing  intervention  not  only  for  students  with  previously 
identified  reading  difficulties  that  have  not  been  adequately  reme¬ 
diated,  but  also  students  with  late-emerging  reading  difficulties 
(Compton,  Fuchs,  Fuchs,  Elleman,  &  Gilbert,  2008). 

The  current  study  suggests  that  a  multicomponent  intervention 
emphasizing  comprehension  instruction  can  allow  students  to  ac¬ 
celerate  their  reading  comprehension  outcomes.  Without  such  in¬ 
terventions,  particularly  given  the  limited  emphasis  within  core 
classroom  instruction  to  support  learning  to  read  in  fourth  grade, 
students  who  do  not  read  proficiently  could  face  serious  and 
ongoing  consequences,  not  only  in  reading  language  arts,  but  also 
across  content  areas. 

On  the  one  hand,  the  positive  effects  for  reading  comprehension 
found  in  our  study  extend  the  limited  evidence  base  on  effective 
multicomponent  reading  interventions  for  upper  elementary  stu¬ 
dents.  On  the  other  hand,  the  lack  of  effects  for  word  reading  or 
vocabulary  underscores  the  need  for  more  research  on  intensive 
interventions  for  fourth  grade  students  with  the  most  severe  read¬ 
ing  difficulties.  For  example,  there  is  an  even  more  intensive  level 
of  the  Passport  to  Literacy  intervention,  which  the  publisher  rec¬ 
ommends  for  students  in  need  of  more  intensive  levels  of  instruc¬ 
tion.  It  is  more  intensive  in  that  students  are  served  in  smaller 
groups  and  for  a  longer  session  and  includes  additional  instruction, 
including  instruction  in  reading  fluency  that  has  been  more  em¬ 
phasized  in  previous  work  (O’Connor  et  ah,  2002;  Vadasy  & 
Sanders,  2008).  It  is  possible  that  this  extended  intervention  will  be 


more  potent  than  the  standard  implementation  of  the  Passport  to 
Literacy  intervention,  providing  the  additional  emphasis  without 
decreasing  the  time  spent  on  comprehension.  To  guide  schools 
intervention  implementation  for  the  upper  elementary  grades,  ad¬ 
ditional  research  is  needed  to  identify  appropriate  interventions, 
describe  for  whom  they  are  effective,  and  to  examine  the  relative 
benefits  of  interventions  with  increasing  intensity  to  meet  ade¬ 
quately  meet  the  varying  needs  of  students. 
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The  present  study  explored  the  bidirectional  and  longitudinal  associations  between  executive  function 
(EF)  and  early  academic  skills  (math  and  literacy)  across  4  waves  of  measurement  during  the  transition 
from  preschool  to  kindergarten  using  2  complementary  analytical  approaches:  cross-lagged  panel 
modeling  and  latent  growth  curve  modeling  (LCGM).  Participants  included  424  children  (49%  female). 
On  average,  children  were  approximately  4.5  years  old  at  the  beginning  of  the  study  ( M  =  4.69,  SD  = 
.30)  and  55%  were  enrolled  in  Head  Start.  Cross-lagged  panel  models  indicated  bidirectional  relations 
between  EF  and  math  over  preschool,  which  became  directional  in  kindergarten  with  only  EF  predicting 
math.  Moreover,  there  was  a  bidirectional  relation  between  math  and  literacy  that  emerged  in  kinder¬ 
garten.  Similarly,  LGCM  revealed  correlated  growth  between  EF  and  math  as  well  as  math  and  literacy, 
but  not  EF  and  literacy.  Exploring  the  patterns  of  relations  across  the  waves  of  the  panel  model  in 
conjunction  with  the  patterns  of  relations  between  intercepts  and  slopes  in  the  LGCMs  led  to  a  more 
nuanced  understanding  of  the  relations  between  EF  and  academic  skills  across  preschool  and  kinder¬ 
garten.  Implications  for  future  research  on  instruction  and  intervention  development  are  discussed. 

Keywords:  executive  function,  mathematics,  literacy,  preschool,  kindergarten 


Over  the  past  decade,  there  has  been  increased  focus  on  chil¬ 
dren’s  executive  function  (EF) — specifically  on  its  development 
and  how  it  relates  to  other  school  readiness  domains.  One  reason 
for  this  surge  of  interest  is  that  EF  in  early  childhood  has  been 
connected  to  a  range  of  critical  developmental  outcomes,  including 
physical  health,  social-emotional  well-being,  and  occupational  at¬ 
tainment  in  adulthood  (Moffitt  et  al.,  2011).  Of  particular  interest 
are  the  significant  and  direct  relations  found  between  early  EF  and 
academic  achievement.  Findings  from  a  number  of  studies  indicate 
that  individual  differences  in  EF  measured  in  early  childhood 
predict  concurrent  and  long-term  math  and  literacy  achievement 
(Duckworth,  Tsukayama,  &  May,  2010;  Fuhs,  Nesbitt,  Farran,  & 
Dong,  2014;  McClelland,  Acock,  Piccinin,  Rhea,  &  Stallings, 
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2013)  as  well  as  growth  in  children’s  higher-level  reasoning  strat¬ 
egies  (Richland  &  Burchinal,  2013). 

Although  the  predictive  link  between  EF  and  early  achievement 
is  established,  it  is  less  clear  whether  early  academic  skills  also 
predict  the  development  of  EF.  Recent  evidence  indicates  that 
there  may  be  a  bidirectional  association  between  EF  and  academic 
skills,  particularly  for  math  (Fuhs  et  al.,  2014;  Welsh,  Nix,  Blair, 
Bierman,  &  Nelson,  2010).  However,  these  studies  were  limited  to 
just  three  time  points  over  the  course  of  the  preschool  and  kinder¬ 
garten  years  and  by  the  analytic  approach  employed  (i.e.,  only 
panel  models  were  used).  Further,  it  is  unclear  whether  growth 
trajectories  in  EF  are  related  to  growth  trajectories  in  other  do¬ 
mains  (e.g.,  math).  The  overarching  goal  of  the  current  study  was 
therefore  to  examine  the  longitudinal  relations  between  EF  and 
academic  skills  across  four  waves  of  measurement  during  the 
transition  from  preschool  to  kindergarten.  We  had  two  specific 
aims.  First,  we  investigated  the  bidirectional  relations  between  EF 
and  academic  skills  (math  and  literacy)  through  a  longitudinal 
panel  model  that  tested  whether  relative  standing  on  the  domains 
at  each  time  point  was  related  to  chariges  in  relative  standing  on 
the  other  domains.  Second,  we  examined  relations  between  growth 
in  EF,  math,  and  literacy  using  latent  growth  curve  models 
(LGCM)  that  tested  whether  the  rate  of  absolute  change  across  all 
time  points  on  the  domains  was  correlated.  The  two  models  pro¬ 
vide  unique  information  by  identifying  when  early  skills  are  most 
related  to  subsequent  skill  development  (i.e.,  panel  models),  and  to 
what  extent  children’s  overall  growth  on  skills  during  this  devel¬ 
opmental  period  are  related  (i.e.,  LGCM). 
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Importance  of  EF  for  Academic  Achievement 

EF  emerges  early  in  life  and  develops  across  the  life  span; 
however,  structural  changes  in  the  prefrontal  cortex  between  ages 
two  and  five  allow  for  dramatic  increases  in  EF  skills  during  early 
childhood  (Zelazo  &  Ulrich,  2011).  Evidence  suggests  that  EF 
involves  three  related,  yet  distinct,  cognitive  processes  (Miyake  et 
ah,  2000):  working  memory  (holding  information  in  mind  while 
processing  other  information;  Gathercole,  Pickering,  Knight,  & 
Stegmann,  2004),  inhibitory  control  (overriding  a  dominant  re¬ 
sponse;  Dowsett  &  Livesey,  2000),  and  cognitive  flexibility  or 
attention  shifting  (maintaining  focus  and  flexibly  adapting  to 
changing  goals;  Rueda,  Posner,  &  Rothbart,  2005).  When  children 
enter  kindergarten,  they  must  adapt  to  new,  more  formal,  and 
structured  educational  contexts  that  may  require  greater  EF  to 
navigate,  compared  to  the  less  formal  and  structured  educational 
environments  experienced  earlier. 

The  transition  from  preschool  to  kindergarten  is  not  only  an 
important  developmental  period  for  EF,  it  is  also  a  time  when  early 
academic  skills  develop  rapidly.  Similar  to  EF,  a  substantial  body 
of  research  highlights  the  importance  of  the  preschool  years  for  the 
development  of  early  literacy  (e.g.,  National  Early  Literacy  Panel, 
2008;  Whitehurst  &  Lonigan,  1998)  and  math  skills  (e.g.,  National 
Mathematics  Advisory  Panel,  2008),  and  it  is  well  known  that 
early  academic  skills  are  precursors  to  later  academic  success  (e.g., 
La  Paro  &  Pianta,  2000;  Stevenson  &  Newman,  1986).  Further¬ 
more,  evidence  supports  an  association  between  early  math  and 
reading  (Duncan  et  al.,  2007;  LeFevre  et  al.,  2010;  Purpura,  Hume, 
Sims,  &  Lonigan,  2011).  These  two  academic  domains  are  related 
over  time  and  children  who  demonstrate  difficulties  in  one  area  are 
at  elevated  risk  for  having  difficulties  in  the  other  (Barbarisi, 
Katusic,  Colligan,  Weaver,  &  Jacobsen,  2005).  Theory  and  re¬ 
search  suggest  that  aspects  of  literacy  may  be  foundational  for 
math  development.  Children  may  need  to  draw  upon  vocabulary 
skills  to  learn  number  words  and  complete  math  tasks  that  are 
inherently  language  based  (LeFevre  et  al.,  2010;  Purpura  et  al., 
2011).  Although  EF,  early  math,  and  emergent  literacy  appear  to 
develop  during  the  same  time  frame,  some  scholars  argue  that  EF 
is  foundational  for  academic  achievement  (Blair  &  Raver,  2015; 
McClelland  et  al.,  2007).  Furthermore,  children’s  EF  is  related  to 
both  their  own  and  their  peers’  acquisition  of  academic  skills 
(Skibbe,  Phillips,  Day,  Brophy-Herb,  &  Connor,  2012).  For  ex¬ 
ample,  Skibbe  and  colleagues  (2012)  found  that  children  demon¬ 
strated  greater  gains  in  literacy  skills  during  the  academic  year 
when  they  were  part  of  classrooms  where  their  classmates  had 
higher  levels  of  EF. 

Theoretical  and  empirical  perspectives  support  the  connection 
between  EF  and  math  and  literacy  skills.  For  children  to  take 
advantage  of  learning  opportunities  in  classroom  contexts,  they 
must  be  able  to  pay  attention,  persist  on  challenging  tasks,  and 
avoid  distractions  (Blair  &  Raver,  2015;  McClelland,  Geldhof, 
Cameron,  &  Wanless,  2015).  Specifically,  strong  EF  may  be 
critical  for  aspects  of  early  math  development  such  as  cardinality 
or  formal  addition,  which  require  children  to  flexibly  shift  atten¬ 
tion  from  procedural  to  more  conceptual  problem  elements  and 
inhibit  previously  learned  rules.  Similarly,  EF  may  be  needed  for 
growth  in  emergent  literacy  skills,  such  as  phonological  aware¬ 
ness,  because  children  must  have  the  ability  to  hold  letter  sounds 
in  mind  and  switch  between  combining  and  separating  sounds  and 


words.  Studies  suggest  there  is  a  predictive  relation  between  EF 
and  math  and  literacy  achievement  in  diverse  samples  of  young 
children,  even  after  controlling  for  relevant  sociodemographic 
factors  (e.g.,  maternal  education,  child  IQ)  and  initial  achievement 
scores  (Bull,  Espy,  &  Wiebe,  2008;  Duncan  et  al.,  2007;  McClel¬ 
land,  Acock,  &  Morrison,  2006).  Findings  from  a  recent  study 
demonstrate  a  long-term  relation  between  EF  and  achievement, 
such  that  children  who  were  rated  higher  on  aspects  of  EF  (e.g., 
attention  and  persistence)  during  preschool  were  more  likely  to 
complete  college  (McClelland  et  al.,  2013).  Even  among  children 
with  academic  difficulties  (i.e.,  those  who  experienced  grade  re¬ 
tention),  EF  appears  to  play  a  role  in  subsequent  math  and  reading 
growth.  For  example,  Chen,  Hughes,  and  Kwok  (2014)  found  that, 
among  children  who  had  been  held  back  a  grade,  those  who 
exhibited  patterns  of  more  rapid  academic  growth  displayed  higher 
EF  skills. 

Although  prior  evidence  suggests  EF  is  associated  with  both 
math  and  literacy  in  early  childhood,  the  concurrent  and  predictive 
relation  between  EF  and  math  seems  to  be  stronger  than  the 
relation  between  EF  and  literacy  in  young  children  (Blair  &  Razza, 
2007;  Blair,  Ursache,  Greenberg,  &  Vernon-Feagans,  2015;  Cam¬ 
eron  Ponitz,  McClelland,  Matthews,  &  Morrison,  2009;  Schmitt, 
Pratt,  &  McClelland,  2014).  Furthermore,  EF  skills  may  mediate 
the  development  of  math  skills  across  the  early  elementary  years, 
but  not  the  development  of  literacy  skills  (Hassinger-Das,  Jordan, 
Glutting,  Irwin,  &  Dyson,  2014).  Several  interpretations  explain¬ 
ing  these  differential  associations  have  been  introduced  in  recent 
literature.  For  example,  one  interpretation  is  that  math  content 
places  more  cognitive  demands  on  children  than  does  literacy 
content.  Math  skills,  therefore,  may  require  stronger  EF  skills  to 
acquire  (Bull  et  al.,  2008;  Clark,  Pritchard,  &  Woodward,  2010; 
Espy  et  al.,  2004;  Willoughby,  Blair,  Wirth,  &  Greenberg,  2012). 
Evidence  from  the  neuroscience  literature  also  indicates  an  overlap 
between  the  brain  regions  that  support  EF  and  math  development 
(Klingberg,  2006),  suggesting  that  growth  in  EF  may  strongly 
facilitate  growth  in  math  while  having  a  weaker  influence  on  the 
development  of  literacy.  A  second  interpretation  is  that  this  rela¬ 
tion  results  from  instructional  content  (or  lack  thereof)  provided  in 
early  childhood  classrooms  (Fuhs  et  al.,  2014).  Preschool  teachers 
spend  significantly  more  time  engaged  in  direct  literacy  instruction 
than  in  math  instruction,  suggesting  that  children  may  need  to  seek 
out  their  own  independent  math  activities  which  may  be  influenced 
by  their  EF  skills.  For  example,  children  who  have  stronger  levels 
of  EF  may  choose  more  complex  and  difficult  math  activities 
during  free  play  (or  may  be  directed  to  by  teachers  and  parents) 
because  they  may  be  more  cognitively  ready  to  do  so.  A  third 
interpretation  is  that  EF  provides  a  foundation  for  the  development 
of  reasoning  abilities  or  fluid  mental  capacities  (e.g.,  problem 
solving),  which  are  typically  required  to  do  well  on  many  math 
assessments  (Blair  et  al.,  2015).  In  contrast,  many  literacy  assess¬ 
ments  are  more  knowledge-based,  making  stronger  demands  on 
crystallized  mental  abilities  (e.g.,  vocabulary)  and  fewer  demands 
on  EF  and  fluid  mental  abilities. 

Bidirectional  Relations  Between  EF  and 
Academic  Skills 

Although  EF  is  considered  by  some  to  be  foundational  for  the 
development  of  academic  skills,  recent  analyses  have  the  bidi- 
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rectionality  between  EF  and  achievement  (Fuhs  et  al.,  2014; 
Welsh  et  al.,  2010).  Indeed,  early  academic  skills  may  be 
important  for  the  development  of  EF,  just  as  EF  is  important  for 
the  development  of  early  academic  skills.  Although  the  ability 
to  pay  attention,  remember  complex  rules,  and  persist  on  chal¬ 
lenging  tasks  likely  helps  children  perform  better  academically 
(Blair  et  al.,  2007;  Blair  &  Raver,  2015),  strong  academic  skills 
may  also  contribute  to  children’s  ability  to  sustain  attention, 
remember  a  series  of  rules,  and  inhibit  incorrect  responses  on 
complex  tasks  (Fuhs  et  al.,  2014).  Engaging  in  a  complex  math 
activity,  for  example,  requires  children  to  identify  the  quantities 
of  multiple  sets,  retain  those  quantities  in  memory,  and  compare 
them. 

Recent  empirical  evidence  has  suggested  that  there  may  be  a 
bidirectional  relation  between  direct  assessments  of  EF  and 
academic  skills.  In  one  study  assessing  developmental  associ¬ 
ations  between  EF  and  academic  skills  during  the  prekinder¬ 
garten  year,  EF  at  the  beginning  of  the  year  predicted  gains  in 
math  and  literacy;  however,  math  at  the  beginning  of  prekin¬ 
dergarten  also  predicted  gains  in  EF  (Welsh  et  al.,  2010).  In  a 
second  study,  Fuhs  and  colleagues  (2014)  found  reciprocal 
associations  between  EF  and  math.  These  associations  were 
maintained  across  preschool,  and,  although  EF  continued  to 
predict  math  through  kindergarten,  the  predictive  relation  of 
math  on  EF  dissipated  between  the  end  of  preschool  and  end  of 
kindergarten.  However,  it  was  not  clear  when  during  this  year  the 
predictive  association  of  math  on  EF  faded.  Results  from  Fuhs  and 
colleagues’  (2014)  also  indicated  a  reciprocal  relation  between  EF  and 
oral  comprehension  skills  across  the  prekindergarten  year,  but  not  for 
other  literacy  skills.  These  findings  provide  initial  evidence  for  a 
bidirectional  relation  between  EF  and  early  achievement;  however, 
the  analyses  utilized  in  these  studies  were  limited  to  three  time  points 
(fall  and  spring  of  preschool  and  spring  of  kindergarten).  The  addition 
of  a  fourth  time  point  at  the  beginning  of  kindergarten  is  needed  to 
understand  these  relations  more  thoroughly;  significant  changes  in 
children’s  experiences  and  in  the  development  of  EF  and  academic 
skills  may  occur  between  the  spring  of  preschool  and  the  spring  of 
kindergarten.  Further,  the  addition  of  a  fourth  time  point  allows  us  to 
determine  when  early  EF,  math,  and  literacy  interventions  may  be 
most  beneficial  and  likely  to  facilitate  cross-domain  growth.  Identi¬ 
fying  more  specific  and  precise  times  at  which  these  relations  may 
change  could  have  applied  implications  as  children  enter  and  move 
through  kindergarten.  Moreover,  the  relations  between  these  variables 
at  different  ages  remain  unclear. 

Correlated  Growth  Between  EF  and  Academic  Skills 

In  addition  to  a  need  for  more  research  on  the  bidirectional 
relations  between  EF  and  academic  skills,  there  is  a  dearth  in 
extant  literature  exploring  whether  the  rates  of  change  in  these 
domains  are  correlated  during  the  transition  to  kindergarten.  Un¬ 
derstanding  whether  growth  in  one  domain  is  related  to  growth  in 
another  has  theoretical  as  well  as  practical  implications  for  instruc¬ 
tion  and  intervention.  From  a  theoretical  standpoint,  exploring 
correlated  growth  across  domains  will  help  us  understand  the 
potential  that  improvements  in  one  domain  lead  to  improvements 
in  another  domain  or  that  other  individual  or  environmental  factors 
may  be  influencing  EF,  math,  and  literacy  development  similarly 
over  time  (Willoughby,  Kupersmidt,  &  Voegler-Lee,  2012),  rather 


than  earlier  skills  in  and  of  themselves.  Some  have  suggested  that 
the  relation  between  EF  and  math  may  be  attributable  to  other 
factors  such  as  IQ,  but  there  is  also  evidence  showing  that  EF  is 
separate  from  IQ  (e.g.,  Blair,  2006).  From  a  practical  standpoint,  if 
improvements  in  EF  are  associated  with  improvements  in  math, 
instruction  or  intervention  efforts  focused  on  EF  may  also  have  a 
beneficial  effect  on  children’s  math  development.  Likewise,  in¬ 
struction  or  intervention  efforts  focused  on  math  or  literacy  could 
have  beneficial  effects  on  children’s  EF  development.  For  exam¬ 
ple,  engaging  in  math  activities  may  not  only  support  the  devel¬ 
opment  of  math  concepts,  but  doing  so  may  also  allow  children  to 
practice  EF  skills  (e.g.,  attending  to  details,  remembering  instruc¬ 
tions).  Similarly,  retaining  details  of  a  story  in  memory  while 
simultaneously  attending  to  new  developments  in  the  plotline  to 
comprehend  the  broader  story  also  may  provide  children  an  op¬ 
portunity  to  practice  EF  skills. 

To  our  knowledge,  no  studies  to  date  have  examined  dual 
trajectory  latent  growth  curves  between  EF  and  academic  skills.  In 
one  related  study,  fixed  effects  models  were  used  to  explore 
whether  intraindividual  change  on  measures  of  EF  predicted  in¬ 
traindividual  change  in  math,  literacy,  and  vocabulary  during  the 
transition  to  kindergarten.  Results  indicated  that  growth  in  EF  on 
some,  but  not  all,  of  the  measures  predicted  growth  in  math,  and 
that  growth  on  one  measure  of  inhibitory  control  was  related  to 
vocabulary  development  (McClelland  et  al.,  2014).  The  current 
study  extends  these  analyses  by  using  LGCM.  Although  using 
fixed  effects  models  can  be  informative,  this  type  of  analysis  only 
explores  relations  between  intraindividual  changes  over  time  be¬ 
tween  the  domains.  In  the  current  study,  we  attempted  to  more 
accurately  measure  children’s  trajectories  using  LGCM,  which 
estimates  associations  across  domains  on  random  intercepts  and 
linear  and  quadratic  slopes.  Further,  LGCM  is  able  to  estimate  the 
associations  between  the  EF,  math,  and  literacy  slopes,  conditional 
on  differences  in  their  intercepts.  However,  the  LGCM  is  not  able 
to  determine  whether  one  domain  contributes  to  or  is  causally 
related  to  development  in  another,  or  whether  other  factors  simul¬ 
taneously  influence  multiple  domains  of  development  (e.g.,  high 
quality  early  math  instruction).  Once  correlated  growth  is  estab¬ 
lished,  follow-up  studies  would  be  needed  to  further  elucidate  the 
relations  between  cross-domain  growth  trajectories. 

Multi-Analytic  Approach 

Previous  work  exploring  longitudinal  relations  between  EF  and 
early  academic  skills  has  typically  taken  a  single-analysis  ap¬ 
proach,  and  this  approach  has  primarily  been  panel  models  (e.g., 
path  analysis).  Although  findings  from  single-analysis  studies  have 
been  useful,  they  provide  limited  information  on  the  development 
of  these  important  skills.  As  Greene  and  colleagues  noted  more 
than  two  decades  ago,  “all  methods  have  inherent  biases  and 
limitations,  so  use  of  only  one  method  to  assess  a  given  phenom¬ 
enon  will  inevitably  yield  biased  and  limited  results”  (Greene, 
Caracelli,  &  Graham,  1989,  p.256;  see  also  Campbell  &  Fiske, 
1959:  Symonds  &  Gorard,  2010).  Thus,  we  took  a  multi-analytic 
approach  to  address  our  overarching  research  goal:  examining  the 
longitudinal  associations  between  EF  and  achievement.  We  first 
implemented  a  cross-lagged  panel  model  using  a  latent  EF  factor 
to  determine  how  children's  relative  standing  on  measures  of  EF, 
math,  and  literacy  was  related  over  time.  That  is,  stability  and 
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cross-lagged  effects  in  cross-lagged  panel  models  determine  the 
stability  of  participants’  relative  standing  on  a  variable  without 
regard  for  whether  the  sample  (or  individual  participants)  actually 
exhibited  gains  in  absolute  magnitude.  High  stability  indicates  that 
participants  who  scored  higher  than  the  sample  mean  at  one  time 
point  tend  to  score  higher  on  the  sample  mean  at  the  previous  time 
point,  regardless  of  whether  that  sample  mean  increased,  de¬ 
creased,  or  remained  the  same  (see  also  Wu,  Selig,  &  Little,  2013). 

Previous  work  examining  the  bidirectional  associations  between 
EF  and  academic  skills  using  cross-lagged  panel  models  (e.g., 
Fuhs  et  al„  2014)  has  relied  on  factor  scores  rather  than  modeling 
latent  associations  directly.  The  use  of  factor  scores  as  dependent 
variables  is  known  to  produce  biased  regression  slopes  and  stan¬ 
dard  errors  (Muthen,  2011;  Skrondal  &  Laake,  2001).  The  extent 
that  previous  findings  are  biased  by  a  reliance  on  factor  scores 
therefore  remains  unclear,  and  we  overcome  this  limitation  by 
modeling  EF  directly  as  a  latent  factor. 

We  also  addressed  our  research  goal  using  a  series  of  LGCMs 
that  examined  absolute  changes  (i.e.,  sample-  and  individual-level 
growth)  in  EF,  math,  and  literacy.  By  examining  absolute  changes, 
the  LGCMs  allowed  us  to  paint  a  more  complete  picture  of  how 
EF,  math,  and  literacy  codevelop  by  demonstrating  to  what  extent 
growth  in  one  domain  is  related  to  growth  in  another  domain 
during  the  same  time  frame.  Thus,  the  panel  models  allowed  us  to 
examine  the  bidirectional  relations  between  EF  and  achievement 
and  whether  relative  standing  on  one  domain  predicts  relative 
standing  on  another  domain  at  the  subsequent  time  point,  and  the 
LGCMs  allowed  us  to  examine  changes  in  absolute  magnitude  and 
relations  between  growth  trajectories  across  all  four  time  points 
(Wu  et  al.,  2013). 

The  Present  Study 

The  goal  of  the  present  study  was  to  clarify  and  expand  upon 
prior  work  (e.g.,  Fuhs  et  al.,  2014;  McClelland  et  al.,  2007)  that  has 
examined  the  longitudinal  relations  between  EF,  math,  and  literacy 
across  the  transition  to  kindergarten  (preschool-kindergarten). 
More  specifically,  we  aimed  to  paint  a  broader  picture  of  how  EF, 
math  and  literacy  are  associated  over  time.  Based  on  recent  theo¬ 
retical  and  empirical  evidence  indicating  that  EF  and  math  may  be 
tightly  coupled  constructs  and  reciprocally  related  (Fuhs  et  al., 
2014;  McClelland  et  al.,  2015),  we  hypothesized  that  EF  would 
significantly  predict  math  in  preschool  and  kindergarten,  and  also 
that  math  would  predict  math  during  this  time  frame.  Further,  we 
expected  that  EF  and  math  growth  trajectories  would  be  correlated, 
although  previous  research  on  associations  between  intraindividual 
change  between  the  two  domains  is  mixed  (McClelland  et  al., 
2014;  Willoughby,  Kupersmidt,  &  Voegler-Lee,  2012).  Previous 
research  has  shown  inconsistent  links  between  EF  and  literacy 
(Blair  &  Razza,  2007;  Cameron  Ponitz  et  al.,  2009;  Schmitt  et  al., 
2014),  nonsignificant  bidirectional  associations  (Fuhs  et  al,  2014), 
and  nonsignificant  associations  for  intraindividual  change  models 
(McClelland  et  al.,  2014).  We  therefore  did  not  expect  that  this 
same  reciprocal  association  would  emerge  for  EF  and  literacy,  nor 
did  we  expect  the  EF  and  literacy  growth  trajectories  to  be  corre¬ 
lated.  We  also  hypothesized  a  bidirectional  relation  as  well  as 
correlated  growth  between  math  and  literacy  due  to  the  noted 
strong  relation  between  early  math  and  literacy  skills  over  the 


preschool  and  kindergarten  years  (Duncan  et  al.,  2007;  LeFevre  et 
al.,  2010;  Purpura  et  al.,  2011). 

Findings  from  this  study  will  contribute  to  the  existing  literature 
in  multiple  ways.  In  contrast  to  other  studies,  we  have  four  data 
points  (fall  and  spring  of  preschool  and  kindergarten),  which  will 
allow  us  to  explore  changes  in  the  relations  as  well  as  growth 
trajectories  between  these  skills  during  the  school  year  and  at 
critical  junctures  throughout  the  transition  from  preschool  to  kin¬ 
dergarten  at  a  more  fine-grained  level.  This  could  have  important 
practical  implications  for  children  as  they  enter  and  progress 
through  kindergarten.  Other  studies  examining  bidirectional  asso¬ 
ciations  between  EF  and  early  achievement  (e.g.,  Fuhs  et  al.,  2014) 
were  limited  to  just  one  data  point  in  kindergarten  (end  of  the 
year).  This  additional  time  point  is  important  in  extending  previous 
work  because  it  allows  us  to  better  identify  at  which  point  between 
the  end  of  preschool  and  end  of  kindergarten  the  relations  between 
EF  and  math  may  change.  That  is,  modeling  change  in  relations 
across  four  waves  of  data  collection  will  allow  a  better  understand¬ 
ing  of  whether  change  occurs  primarily  during  the  school  year  (i.e., 
between  Times  1  and  2  and  between  Times  3  and  4),  or  if  the 
change  is  relatively  constant  across  time.  In  addition,  we  modeled 
latent  associations  directly  rather  than  relying  on  factor  scores  that 
may  produce  biased  results.  Finally,  no  studies  to  date  have  ex¬ 
amined  whether  growth  trajectories  of  EF,  math,  and  literacy  are 
correlated  using  LGCM.  Our  multi-analytic  approach  also  allows 
us  to  examine  the  same  overarching  research  question  using  two 
types  of  analyses,  allowing  us  to  better  distill  a  single  story  from 
multiple  models  that  acknowledge  diverse  ways  development  can 
manifest  while  reducing  methodological  biases.  Results  from  the 
present  study  will  further  our  understanding  of  the  complexity  of 
the  relations  between  EF,  math,  and  literacy,  which  could  have 
theoretical  implications  as  well  as  implications  for  the  design  and 
timing  of  instruction  and  intervention  efforts  in  preschool  and 
early  elementary  school. 

Method 

Participants  and  Procedure 

Children  and  families  (N  =  435)  were  recruited  from  38  class¬ 
rooms  in  17  preschools  in  a  small  city  in  the  Pacific  Northwest  to 
participate  in  a  federally  funded  study  focused  on  refining  and 
evaluating  the  Head-Toes-Knees-Shoulders  (HTKS)  task,  a  direct 
assessment  of  EF,  as  a  screening  tool  for  children  ages  4-5.  As 
part  of  this  study,  several  measures  of  EF  as  well  as  a  math  and 
literacy  assessment  were  collected  at  4  data  points  from  2011  to 
2014.  There  was  no  intervention  included  as  part  of  the  larger 
study  that  would  influence  the  interpretation  of  our  results.  To 
recruit  schools,  the  principal  investigator  contacted  preschool  di¬ 
rectors  via  telephone,  e-mail  and  via  individual  meetings  to  invite 
preschools  to  be  a  part  of  the  study.  Preschools  were  selected  using 
a  convenience  sampling  approach  (i.e.,  preschools  that  were  ac¬ 
cessible  and  willing  to  participate  in  the  study).  Children  were 
excluded  if  they  were  younger  than  4  years  old  ( n  =  5)  or  older 
than  5.5  years  old  {n  =  1)  in  the  fall  of  preschool.  Additionally, 
children  were  excluded  if  they  did  not  participate  in  the  study  in 
the  fall  of  preschool  (n  =  5).  The  remaining  424  eligible  children 
were  included  in  the  sample  in  the  current  study. 
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Parents  signed  a  written  informed  consent  statement  to  allow 
their  child  to  participate  in  the  study  that  was  approved  by  the 
university  Institutional  Review  Board.  Children  gave  verbal  assent 
prior  to  participating  in  direct  assessments.  After  consenting  to  the 
study,  children  were  assessed  in  two  to  three  sessions  (lasting  10 
to  15  min  each)  during  the  fall  and  spring  of  their  preschool  and 
kindergarten  years  (4  waves  total).  At  each  wave  of  data  collec¬ 
tion,  families  received  a  $20  gift  card  for  their  participation.  In  the 
fall  of  preschool,  55%  of  the  children  were  enrolled  in  Head  Start 
and  15%  were  primarily  Spanish  speakers  (all  Spanish  speakers 
were  enrolled  in  Head  Start).  Teachers  identified  which  children  in 
their  classrooms  were  Spanish-speaking  and  should  receive  the 
assessments  in  Spanish.  We  chose  this  method  for  identifying 
Spanish  speakers  because  teachers  have  the  most  experience  with 
children  in  their  classroom  context  and  to  avoid  overtesting  chil¬ 
dren  by  administering  assessments  in  both  languages.  Parent  de¬ 
mographic  questionnaires  were  collected  during  the  first  wave  of 
the  study  (in  Spanish  when  applicable;  n  =  372,  88%  response 
rate).  The  sample  was  predominantly  reported  as  White  (63%), 
followed  by  Latino/Hispanic  (19%),  multiracial  (13%),  Asian/ 
Pacific  Islander  (3%),  and  other  ethnicities  (2%).  Self-reported 
parent  (87%  maternal)  education  ranged  from  0  to  30  years,  with 
an  average  of  approximately  two  years  in  college  (M  =  14.40, 
SD  =  3.68).  Children  enrolled  in  Head  Start  had  parents  with 
significantly  lower  reported  years  of  education  ( M  =  11.58,  SD  = 
3.06)  than  the  parents  of  children  not  enrolled  in  Head  Start  ( M  — 
17.34,  SD  -  3.14;  t(351)  =  17.48,  p  <  .001).  Among  children 
enrolled  in  Head  Start,  the  primarily  Spanish  speaking  children  had 
parents  with  significantly  lower  reported  years  of  education  (M  = 
9.08,  SD  =  3.12)  than  their  English-speaking  peers  ( M  =  12.59, 
SD  =  2.38;  /(178)  =  8.17,  p  <  .001). 

Measures 

At  each  wave  of  the  study,  children  were  assessed  on  executive 
function  (EF),  literacy,  and  math  skills.  EF  was  assessed  with  four 
measures:  the  Head-Toes-Knees-Shoulders  (HTKS)  task,  a  Card 
Sort  task,  the  Auditory  Working  Memory  subtest  from  the 
Woodcock-Johnson  III  Tests  of  Cognitive  Abilities,  and  the  Si¬ 
mon  Says  task.  Literacy  skills  were  assessed  with  the  Letter-Word 
Identification  subtest  from  the  Woodcock-Johnson  III  Tests  of 
Achievement  Abilities.  Math  skills  were  assessed  with  the  Applied 
Problems  subtest  from  the  Woodcock-Johnson  III  Tests  of 
Achievement  Abilities. 

Head-Toes-Knees-Shoulders.  The  HTKS  was  used  to  as¬ 
sess  children’s  cognitive  flexibility,  working  memory,  and  inhib¬ 
itory  control  through  gross  motor  responses  (McClelland  &  Cam¬ 
eron,  2012;  McClelland  et  al.,  2014).  In  previous  research,  the 
measure  has  been  significantly  related  to  measures  of  cognitive 
flexibility,  working  memory,  and  inhibitory  control  (see  McClel¬ 
land  et  al.,  2014).  There  are  two  parallel  forms  of  the  HTKS,  which 
only  differ  for  part  one  of  the  assessment  (McClelland  et  al.,  2014). 
The  measure  includes  three  sections  of  10  items  each,  with  the  task 
becoming  progressively  harder.  In  part  one,  children  were  in¬ 
structed  to  touch  their  toes  (knees  in  the  parallel  form)  when  told 
to  “touch  your  head  (shoulders  in  the  parallel  form)”  and  vice 
versa.  In  parts  two  and  three,  rules  were  changed  and  added, 
increasing  the  complexity  of  the  task.  Possible  scores  range  from 
0  to  60,  with  a  total  of  30  test  items  receiving  scores  of  0 


(incorrect),  1  (self-correct),  or  2  (correct).  Previous  research  indi¬ 
cates  high  interrater  agreement  (k  >  .90)  and  evidence  supports 
convergent  and  predictive  validity  of  this  measure  when  assessing 
children’s  EF  in  culturally  diverse  samples  and  in  different  lan¬ 
guages  (McClelland  et  al.,  2007,  2014;  Wanless,  McClelland, 
Acock,  Ponitz,  et  al.,  2011).  In  the  current  sample,  this  measure 
demonstrated  strong  internal  consistency  across  all  waves  (Cron- 
bach’s  alpha:  Wave  1  =  .96,  Wave  2  -  .96,  Wave  3  -  .96,  Wave 
4  =  .95). 

Card  Sort  task.  Children’s  cognitive  flexibility  was  assessed 
using  a  Card  Sort  task  similar  to  the  traditional  Dimensional 
Change  Card  Sort  measure  (Blackwell,  Cepeda,  &  Munakata, 
2009;  Frye,  Zelazo,  &  Palfai,  1995;  Zelazo,  2006).  Administration 
procedures  were  similar  to  those  described  by  Hongwanishkul, 
Happaney,  Lee,  and  Zelazo  (2005).  The  Card  Sorting  task  con¬ 
sisted  of  up  to  24  items,  with  each  sorting  trial  having  6  items. 
During  this  task,  children  were  asked  to  sort  colored  picture  cards 
of  a  dog,  fish,  or  bird  on  the  basis  of  three  dimensions:  color, 
shape,  and  size.  Four  sorting  boxes  with  target  cards  (either  a  dog, 
fish,  bird,  or  frog)  affixed  on  them  were  placed  directly  in  front  of 
children.  The  frog  target  card  was  meant  to  be  a  distractor,  and 
thus,  there  were  no  picture  cards  with  frogs  on  them.  The  same 
target  and  test  cards  were  used  for  all  participants.  Children  were 
given  one  practice  trial  prior  to  testing  trials.  During  all  test  trials, 
children  were  given  a  test  card  (that  had  the  same  picture  on  it  as 
one  of  the  target  cards)  and  asked  the  question,  “Where  does  this 
one  go?”  and  they  were  to  place  the  card  in  one  of  the  boxes.  No 
feedback  was  given.  For  the  first  six  items  (preswitch  trial), 
children  were  to  sort  on  the  basis  of  shape  (e.g.,  the  dog  cards  go 
in  the  sorting  box  with  the  dog  card  affixed).  For  the  second  six 
items  (postswitch  trial),  children  were  told  they  were  going  to  play 
a  new  game  and  would  now  sort  on  the  basis  of  color.  For  the  third 
six  items  (postswitch  trial),  children  were  told  they  were  going  to 
play  a  new  game  and  would  now  sort  on  the  basis  of  size.  If 
children  scored  five  or  more  points  on  the  third  section,  a  fourth  set 
of  items  were  administered  which  consisted  of  a  new  rule:  when 
the  card  had  a  black  border  on  it,  children  were  to  sort  on  the  basis 
of  size.  When  the  card  did  not  have  a  black  border,  children  were 
to  sort  on  the  basis  of  color.  All  items  were  weighted  equally 
(including  preswitch  trial  items).  Children  were  given  a  score  of  0 
for  an  incorrect  response  and  1  for  a  correct  response,  with  scores 
ranging  from  0  to  24.  This  assessment  demonstrated  strong  internal 
consistency  in  the  current  sample  across  all  waves  for  all  sections 
(Cronbach’s  alpha:  Wave  1  =  .95,  Wave  2  =  .93,  Wave  3  =  .91, 
Wave  4  =  .88). 

Auditory  Working  Memory.  The  Auditory  Working  Mem¬ 
ory  subtest  from  the  Woodcock-Johnson  III  Tests  of  Cognitive 
Abilities  (Woodcock,  McGrew,  &  Mather,  2001b)  or  the  Bateria 
III  Woodcock-  Munoz  (Munoz-Sandoval,  Woodcock,  McGrew,  & 
Mather,  2005b)  was  used  to  assess  children’s  working  memory. 
The  task  required  children  to  repeat  Ijack  to  the  experimenter 
things  and  numbers  in  a  specific  order.  That  is,  children  had  to 
hold  information  in  mind  and  then  reproduce  it  in  a  different  order. 
This  standardized  task  demonstrates  strong  internal  consistency 
for  English-speaking  and  Spanish-speaking  preschool  children 
(Woodcock  et  al.,  2001b).  In  the  current  sample,  internal  consis¬ 
tency  was  good  across  all  waves  for  the  full  sample  (Cronbach’s 
alpha:  Wave  1  =  .87,  Wave  2  =  .89,  Wave  3  =  .85,  Wave  4  = 
.82),  and  for  the  English-speaking  children  only  (Cronbach’s  al- 
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pha:  Wave  1  =  .89,  Wave  2  =  .88,  Wave  3  =  .86,  Wave  4  =  .81) 
and  the  Spanish-speaking  children  only  (Cronbach’s  alpha:  Wave 
1  =  .92,  Wave  2  =  .83,  Wave  3  =  .85,  Wave  4  =  .91). 

Simon  Says  task.  The  Simon  Says  task  was  used  to  assess 
inhibitory  control  (Carlson,  2005;  Strommen,  1973).  The  Simon 
Says  task  has  been  identified  in  previous  research  as  an  advanced 
anti-imitation  task  and  a  measure  of  inhibitory  control  in  that  it 
requires  children  to  inhibit  a  prepotent  response  (i.e.,  do  all  re¬ 
quested  actions)  in  favor  of  a  different  response  (i.e.,  only  do  the 
action  if  experimenter  says  “Simon  Says”;  Carlson,  2005).  Spe¬ 
cifically,  children  were  asked  to  perform  an  action  only  if  the 
experimenter  said,  “Simon  says,”  but  to  remain  still  otherwise.  Of 
the  10  total  trials,  five  trials  required  inhibitory  control.  These 
trials  were  scored  and  children  were  given  a  proportion  score  of 
the  number  correct  (items  requiring  inhibitory  control).  In  previous 
studies,  this  measure  has  been  significantly  correlated  with  other 
assessments  of  inhibitory  control  (McClelland  et  al.,  2014).  Inter¬ 
nal  consistency  for  this  assessment  was  good  across  all  waves 
(Cronbach’s  alpha:  Wave  1  =  .87,  Wave  2  =  .89,  Wave  3  =  .85, 
Wave  4  =  .82). 

Reliability  of  EF.  Using  the  factor  loadings  presented  below 
and  discussed  later,  we  computed  composite  reliability  (co;  Mc¬ 
Donald,  1970,  1999;  Raykov,  1997;  Werts,  Linn,  &  Joreskog, 
1974)  for  each  EF  construct,  co  is  identical  to  Cronbach’s  (1951) 
coefficient  a,  except  that  it  relaxes  the  assumption  of  essential  tau 
equivalence  (i.e.,  an  assumption  that  all  items  have  equal  factor 
loadings  onto  the  latent  construct).1  Reliability  for  the  EF  factors 
was  weak  but  acceptable  and  increased  across  the  four  waves  of 
the  present  study  (co  =  .69,  .74,  .74,  .78,  for  Waves  1  through  4, 
respectively). 

Measures  of  academic  achievement.  Children’s  literacy  and 
math  skills  were  assessed  using  the  Woodcock  Johnson  Psycho- 
Educational  Battery-III  Tests  of  Achievement  (WJ-III;  Woodcock, 
McGrew,  &  Mather,  2001a)  in  English  or  the  Bateria  TIT 
Woodcock-Munoz  (Munoz-Sandoval,  Woodcock,  McGrew,  & 
Mather,  2005a)  in  Spanish.  In  a  study  using  a  large  and  diverse 
sample  of  2000  children,  cross-language  equating  procedures  were 
employed  using  item-response  theory  (IRT)  methods.  Results  sug¬ 
gested  that  the  WJ-III  and  the  Woodcock-Munoz  assess  the  same 
competencies  and  can  be  combined  appropriately  for  use  in  cross¬ 
language  studies  (Woodcock  &  Munoz-Sandoval,  1993). 
Woodcock-Johnson  W-scores  were  used  because  they  utilize 
Rasch-based  measurement  models  to  create  equal-interval  scale 
characteristics,  with  the  W-score  centered  at  500  as  the  approxi¬ 
mate  average  performance  of  a  10-year-old  (Mather  &  Woodcock, 
2001). 

Letter-Word  Identification.  Children’s  literacy  skills  were 
measured  using  the  Letter-Word  Identification  subtest  of  the  WJ- 
III  (Woodcock  et  al.,  2001a)  or  the  Bateria  III  Woodcock-Munoz 
(Munoz-Sandoval  et  al.,  2005a).  This  test  measures  letter  identi¬ 
fication  and  word-reading  skills  through  expressive  and  receptive 
items  and  had  strong  internal  consistency  for  both  the  English- 
speaking  (Cronbach’s  alpha:  Wave  1  =  .92,  Wave  2  =  .92,  Wave 

3  =  .94,  Wave  4  =  .94)  and  Spanish-speaking  children  (Cron¬ 
bach’s  alpha:  Wave  1  =  .83,  Wave  2  =  .80,  Wave  3  =  .83,  Wave 

4  =  .90)  in  the  present  sample.  Although  these  two  subtests  have 
been  deemed  comparable  in  rigorous  cross-language  validation 
studies  in  terms  of  content  and  difficulty  (Woodcock  et  al.,  2001b), 
they  could  not  be  appropriately  combined  to  provide  full-sample 


reliabilities  because  children  receive  different  items  to  ensure 
cultural  relevance. 

Applied  problems.  Children’s  math  skills  were  measured  us¬ 
ing  the  Applied  Problems  subtest  of  the  WJ-III  (Woodcock  et  al., 
2001a)  or  the  Bateria  III  Woodcock-Munoz  (Munoz-Sandoval  et 
al.,  2005a).  This  measure  assesses  children’s  early  mathematical 
operations  (e.g.,  counting,  addition,  and  subtraction)  through  prac¬ 
tical  problems  and  had  good  internal  consistency  for  the  full 
sample  (Cronbach’s  alpha:  Wave  1  =  .86,  Wave  2  =  .87,  Wave 

3  =  .85,  Wave  4  =  .83),  for  English-speaking  children  only 
(Cronbach’s  alpha:  Wave  1  =  .80,  Wave  2  =  .81,  Wave  3  =  .79, 
Wave  4  =  .81),  and  for  Spanish-speaking  children  only  (Cron¬ 
bach’s  alpha:  Wave  1  =  .86,  Wave  2  =  .82,  Wave  3  —  .82,  Wave 

4  =  .80). 

Analytic  Approach 

We  examined  longitudinal  relations  between  EF  and  two  aca¬ 
demic  domains:  math  and  literacy.  As  described  above,  we  ex¬ 
plored  these  relations  using  two  separate  sets  of  analyses  (i.e., 
cross-lagged  panel  models  and  LGCM)  to  obtain  a  more  complete 
understanding  of  our  data  than  could  be  provided  by  either  analysis 
alone.  Although  we  had  specific  hypotheses  for  our  research 
questions,  we  did  not  favor  one  analytic  approach  over  the  other 
when  interpreting  the  models  that  were  used  to  answer  our  re¬ 
search  questions.  Instead,  we  chose  two  analytic  models  because 
each  model  provides  a  unique  perspective  on  the  data  at  hand  and 
to  our  overarching  research  question.  Treating  each  model  as 
equally  informative  allows  for  a  fuller  understanding  of  the  devel¬ 
opmental  processes  impacting  EF  and  academic  achievement. 

Participating  children  were  nested  in  classrooms  at  each  wave, 
and  we  computed  ICCs  for  all  target  variables  (i.e.,  EF  indicators 
as  well  as  math  and  literacy  scores).  The  models  for  these  ICCs 
specified  wave-specific  clustering,  such  that  ICCs  for  Wave  1 
variables  used  Wave  1  classrooms  as  the  clustering  units,  ICCs  for 
the  Wave  2  variables  used  Wave  2  classrooms  as  the  clustering 
units,  et  cetera.  We  anticipated  that  between-classroom  differences 
would  be  strongly  related  to  sociodemographic  factors,  and  we 
supplemented  our  examination  of  ICCs  with  computation  of  con¬ 
ditional  ICCs.  To  obtain  conditional  ICCs,  we  first  regressed  all 
EF  and  academic  achievement  variables  on  participant  age  (at 
Time  1),  Head  Start  Status,  and  ELL  status  in  a  single-level 
regression  model  and  stored  the  residuals  from  these  models  (i.e., 
residual  centering).  We  then  fit  a  saturated  two-level  path  analysis 
(i.e.,  freely  estimating  all  item  variances  and  covariances  at  both 
levels)  for  each  wave  of  data  and  obtained  conditional  ICCs  for  the 
EF  and  academic  achievement  variables. 

As  Table  1  shows,  all  variables  exhibited  substantial  variability 
at  the  between-classroom  level  (i.e.,  all  ICCs  >  .05),  but  this 
variance  was  largely  accounted  for  by  the  demographic  covariates. 
Only  kindergarten  literacy  retained  a  substantial  amount  of 
between-classroom  variance.  Appropriately  modeling  the  longitu- 


1  Willoughby,  Pek,  and  Blair  (2013),  have  advocated  for  the  use  of 
maximal  reliability — the  reliability  of  an  optimally  weighted  composite — 
when  examining  latent  EF  factors.  However,  recent  simulation  evidence 
has  drawn  the  usefulness  of  maximal  reliability  into  question  (Geldhof, 
Preacher,  &  Zyphur,  2014).  We  therefore  do  not  include  estimates  of 
maximal  reliability  in  the  present  study. 
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Table  1 

ICCs  for  All  Items  by  Wave-Specific  Cluster 


Construct 

indicator 

Unconditional 

Conditional 

W1 

W2 

W3 

W4 

W1 

W2 

W3 

W4 

Executive  funedon 

* 

Working  memory 

.09 

.12 

.14 

.12 

.04 

.00 

.01 

.06 

Simon  says 

.06 

.09 

.10 

.11 

.02 

.02 

.03 

.06 

HTKS 

.12 

.19 

.13 

.10 

.06 

.03 

.02 

.02 

Card  sort 

.11 

.11 

.13 

.07 

.03 

.00 

.01 

.02 

Literacy 

.21 

.21 

.17 

.12 

.05 

.02 

.14 

.15 

Math 

.19 

.17 

.19 

.19 

.01 

.04 

.05 

.03 

Note.  Conditional  intraclass  correlations  (ICCs)  control  for  age  (at 
Time  1),  Head  Start  status,  and  English  language  learner  (ELL)  status. 
HTKS  =  Head-Toes-Knees-Shoulders  task. 

dinal  observations  as  nested  in  children  and  also  cross-classified  in 
wave-specific  classrooms  would  complicate  our  results  and  detract 
from  model  interpretability.  Given  that  controlling  for  the  covari¬ 
ates  using  single-level  regression  largely  accounted  for  between- 
classroom  variation  in  the  measures,  we  present  single-level  mod¬ 
els  that  control  for  the  same  covariates  included  when  computing 
conditional  ICCs.  The  caveat,  therefore,  is  that  the  standard  errors 
for  paths  involving  kindergarten  literacy  may  be  slightly  biased. 

Data  screening.  Both  sets  of  analyses  used  robust  maximum 
likelihood  estimation  (MLR  in  Mplus)  to  deal  with  non-normality 
and  missing  data  (Muthen  &  Muthen,  1998-2015).  Skewness 
ranges  for  the  four  EF  tasks  and  achievement  measures 
were  -1.23  to  1.97  at  Wave  1,  -1.20  to  0.83  at  Wave  2,  -1.85 
to  1.02  at  Wave  3,  and  -2.37  to  0.74  at  Wave  4.  Kurtosis  ranges 
were  1.29  to  5.87  at  Wave  1,  1.72  to  6.16  at  Wave  2,  1.44  to  7.07 
at  Wave  3,  and  1.63  to  8.58  at  Wave  4.  For  children  participating 
in  the  study  at  any  given  wave  (i.e.,  missing  data  not  due  to 
children  leaving  the  study  in  between  waves  of  data  collection), 
there  was  less  than  6%  missing  data  on  direct  assessments  and  no 
missing  data  on  age,  gender,  Flead  Start  status,  or  language  status 
(see  Table  2  for  the  number  of  observations  for  every  variable). 
Once  missing  data  due  to  attrition  was  factored  in  (i.e.,  children 
leaving  the  longitudinal  study  and  resulting  in  a  loss  of  data  at  later 
waves),  the  range  of  missing  data  was  0-30.66%  for  individual 
measures  (average  missingness  was  15.34%).  Two  variables  had 
30.66%  missing  data  at  wave  four  (i.e.,  the  Simon  Says  task  and 
Auditory  Working  Memory;  294  observations  out  of  the  original 
sample  size  of  424).  Because  most  missing  data  occurred  due  to 
participant  attrition,  we  created  binary  variables  (0  =  did  not  leave 
study,  1  =  did  leave  study)  to  test  whether  any  of  our  covariates 
were  related  to  attrition  throughout  the  study.  None  of  our  cova¬ 
riates  were  related  to  attrition  that  occurred  within  the  school  years 
(i.e.,  Wave  1  to  Wave  2  and  Wave  3  to  Wave  4).  For  attrition 
between  Waves  2  and  3  (i.e.,  the  transition  from  prekindergarten  to 
kindergarten),  we  found  that  Head  Start  status  ( b  =  0.72,  p  =  .005) 
and  parent  education  ( b  =  -0.08,  p  =  .015)  were  significantly 
related  to  attrition  when  running  bivariate  logistic  regression  mod¬ 
els.  In  other  words,  children  in  Head  Start  and  children  of  parents 
with  fewer  years  of  education  were  more  likely  to  leave  the  study 
between  the  spring  of  prekindergarten  and  fall  of  kindergarten. 
However,  when  both  predictors  were  used  to  predict  attrition, 
neither  was  significant,  suggesting  substantial  shared  variance  in 
their  relation  to  attrition  (i.e.,  reduction  in  size  of  coefficient  and 


increases  in  standard  errors).  Thus,  all  models  included  Head  Start 
status  (as  opposed  to  parent  education,  which  had  substantial 
missing  data),  along  with  child  age  and  language  status  as  time- 
invariant  covariates. 

Cross-lagged  panel  model.  We  used  a  cross-lagged  panel 
model  to  examine  whether  changes  in  relative  standing  on  each 
construct  (EF,  math,  literacy)  were  related  over  time.  The  cross- 
lagged  panel  models  specifically  tested  whether  children  whose 
EF,  math,  and  literacy  scores  were  higher  (or  lower)  than  their 
peers  at  earlier  waves  were  also  higher  (or  lower)  than  their  peers 
at  subsequent  times  of  measurement  (i.e.,  a  test  of  stability).  After 
controlling  for  the  stability  of  relative  standing,  these  analyses  also 
allowed  us  to  test  whether  relative  standing  on  one  variable  at 
earlier  waves  predicted  changes  in  relative  standing  (not  changes 
in  absolute  magnitude)  on  a  different  variable  at  subsequent  waves. 
For  each  cross-lagged  effect  (e.g.,  EF  predicting  changes  in  math), 
we  simultaneously  examined  the  reciprocal  relation  (e.g.,  math 
predicting  changes  in  EF)  as  a  test  of  bidirectionality. 

We  first  specified  a  longitudinal  confirmatory  factor  analysis 
(CFA),  controlling  all  indicators  for  participants’  age  at  the  begin¬ 
ning  of  the  study,  EEL  status,  and  Head  Start  enrollment  status. 
This  approach  to  controlling  for  covariates  allows  minor  differ¬ 
ences  between  indicators  and  the  covariates  to  not  impact  overall 
model  fit  (see  also  Geldhof,  Pomprasertmanit,  Schoemann,  & 
Little,  2013).  The  initial  CFA  allowed  us  to  examine  the  structure 
of  EF  because,  although  there  is  strong  evidence  in  younger 
children  that  EF  is  best  described  as  a  unitary  construct  (Hughes, 
Ensor,  Wilson,  &  Graham,  2009;  Wiebe,  Espy,  &  Charak,  2008), 
there  is  also  evidence  that  it  becomes  more  differentiated  over  time 
(Huizinga,  Dolan,  &  van  der  Molen,  2006;  Lehto,  Juujarvi,  Koois- 
tra,  &  Pulkkinen,  2003).  Good  fit  for  a  CFA  that  specified  a  single 
EF  factor  per  time  point  would  support  the  underlying  assumption 
of  our  analyses — that  EF  is  reasonably  unidimensional  as  it  was 
measured  in  this  sample. 

We  scaled  all  latent  variables  in  the  initial  CFA  by  fixing  latent 
means  of  zero  and  latent  variances  to  one.  We  modeled  math  and 
literacy  as  single-indicator  factors  by  freely  estimating  the  factor 
loading  for  each  indicator  onto  its  respective  construct  and  addi¬ 
tionally  fixing  the  indicators’  residual  variances  to  zero.  To  ac¬ 
count  for  correlated  residuals  over  time,  we  estimated  residual 
covariances  within  each  indicator  of  EF  (e.g.,  all  HTKS  indicators 
were  allowed  to  covary,  independent  of  their  relations  implied  by 
the  stability  of  EF  as  a  latent  construct).  Figure  B 1  in  the  Appendix 
provides  a  partial  path  diagram  that  illustrates  the  EF  component 
of  this  model. 

We  established  measurement  invariance  of  the  EF  construct 
across  waves  using  the  change  in  confirmatory  fit  index  (CFI) 
criterion  suggested  by  Cheung  and  Rensvold  (CFI  decreases  by  < 
.01;  2002).  Modeling  invariance  requires  equating  factor  loadings 
(weak  invariance)  and  intercepts  (strong  invariance),  allowing  for 
differences  at  the  latent  level  (i.e.,  latent  variances  and  means, 
respectively;  see  Little,  1997  for  a  discussion).  Thus,  latent  vari¬ 
ances  for  EF  in  Times  2  through  4  were  freely  estimated  in  the 
weak  invariance  model,  and  latent  means  for  EF  in  Times  2 
through  4  were  additionally  freed  in  the  strong  invariance  model. 
These  tests  ensured  that  the  qualitative  meaning  of  EF  remained 
stable  across  the  four  waves  of  data  collection  rather  than  EF  being 
strongly  indicated  by  one  measure  in  earlier  waves  and  strongly 
indicated  by  a  different  measure  in  later  waves.  Invariance  could 
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Table  2 


Descriptive  Statistics  for  All  Study  Variables 


Prekindergarten  (Year  1) 

Kindergarten  (Year  2) 

Fall  (Wave  1) 

Spring  (Wave  2) 

Fall  (Wave  3) 

Spring  (Wave  4) 

Variable 

N 

M  ( SD ) 

N 

M  ( SD ) 

N 

M  ( SD ) 

N 

M  (SD) 

Age 

Percent  male 

424 

424 

4.70  (0.30) 

51% 

394 

394 

5.15(0.30) 

51% 

308 

308 

5.67  (0.30) 

50% 

299 

299 

6.17  (0.29) 
51% 

Percent  Head  Start 

424 

55% 

394 

54% 

308 

51% 

299 

51% 

Percent  ELL 

424 

15% 

394 

15% 

308 

15% 

299 

14% 

Parent  education 

353 

14.40  (4.23) 

336 

14.34  (4.26) 

275 

14.60  (4.45) 

269 

14.67(4.46) 

HTKS 

403 

17.41  (17.20) 

391 

25.15  (18.28) 

303 

33.17  (17.74) 

296 

39.22(16.00) 

Card  sort 

409 

13.64  (6.67) 

389 

16.49  (5.92) 

307 

18.60  (4.88) 

295 

19.78  (3.88) 

Working  memory 

400 

450.30  (14.80) 

385 

456.17  (17.97) 

303 

464.60(19.21) 

294 

473.18  (19.90) 

Simon  Says 

408 

0.14(0.28) 

387 

0.29  (0.38) 

307 

0.45  (0.39) 

294 

0.54  (0.38) 

Math 

401 

410.17  (23.30) 

391 

419.83  (23.11) 

305 

431.02  (20.71) 

295 

442.09  (19.29) 

Literacy 

408 

335.65  (26.59) 

390 

349.33  (26.80) 

305 

366.00(29.14) 

295 

400.24  (35.21) 

Note.  ELL  English  language  learner;  HTKS  —  Head-Toes-Knees-Shoulders  task;  Working  memory  =  Auditory  Working  Memory  subtest  from  the 
Woodcock-Johnson  III  Tests  of  Cognitive  Abilities;  Math  =  Applied  Problems  subtest  from  the  Woodcock-Johnson  III  Tests  of  Achievement;  Literacy  = 
Letter-Word  Identification  subtest  from  the  Woodcock-Johnson  III  Tests  of  Achievement. 


not  be  tested  for  math  or  literacy  because  those  factors  had  only 
one  indictor  per  time  point  (e.g.,  equating  factor  loadings  for  math 
over  time  would  result  in  three  additional  degrees  of  freedom  that 
would  then  be  lost  by  freely  estimating  the  latent  variances  for 
math  at  Times  2  through  4,  resulting  in  no  change  in  model  fit). 

After  establishing  measurement  invariance,  we  specified  a  lon¬ 
gitudinal  structural  equation  model  (SEM)  that  included  single-lag 
stability  regressions  (e.g.,  EF  at  Time  1  predicting  EF  at  Time  2) 
and  single-lag  cross-construct  regressions  (e.g.,  EF  at  Time  1 
predicting  math  at  Time  2).  We  freely  estimated  all  within-wave 
covariances  (e.g.,  EF  at  Time  1  covaried  with  math  and  literacy  at 
Time  1).  The  cross-lagged  panel  model  assumes  no  longitudinal 
covariances  except  those  specified  by  the  longitudinal  regression 
coefficients,  and  we  tested  this  assumption  by  first  estimating  all 
covariances  between  constructs  separated  by  more  than  one  lag 
(e.g.,  EF  at  Time  1  covaried  with  math  at  Times  3  and  4).  The 
latent  variable  covariance  matrix  was  therefore  completely  satu¬ 
rated,  and  our  initial  SEM  model  had  identical  fit  to  our  strong- 
invariance  CFA  model.  We  then  tested  the  assumption  of  no 
longitudinal  covariance  by  removing  all  covariances  between  con¬ 
structs  separated  by  more  than  one  lag  and  performing  a  likelihood 
ratio  test. 

LGCM.  To  examine  whether  rates  of  change  in  EF,  math,  and 
literacy  were  correlated  in  our  data,  we  estimated  the  associations 
between  the  growth  parameters  for  each  construct  in  a  three- 
trajectory  LGCM.  Based  on  the  assumption  that  growth  in  the 
target  variables,  especially  EF  (Zelazo  et  al.,  2013),  may  be  non¬ 
linear,  we  specified  quadratic  growth  curves  for  all  target  con¬ 
structs.  The  model  then  examined  how  initial  standing  (i.e.,  the 
random  intercepts)  and  the  rates  of  change  and  acceleration  (i.e., 
the  random  linear  and  quadratic  slopes)  were  correlated.  The 
LGCM  treated  EF  as  a  latent  factor,  meaning  the  growth  model  for 
that  construct  was  technically  a  curve-of-factors  model  (McArdle, 
1988;  see  also  Hancock,  Kuo,  &  Lawrence,  2001).  We  imposed 
the  same  invariance  constraints  from  the  panel  model  on  the  EF 
factor  in  the  growth  model,  although  factors  in  the  growth  model 
were  identified  by  constraining  the  factor  loading  of  HTKS  to  one 
and  fixing  the  intercepts  for  all  HTKS  indicators  to  zero.  Latent 


intercepts  for  EF  were  also  fixed  to  zero  to  identify  the  growth 
component  of  the  model  (see  also  Figure  1  in  Hancock  et  al., 
2001).  Due  to  model  complexity,  and  to  acknowledge  that  the 
covariates  were  between-persons  variables,  we  controlled  for  all 
covariates  at  the  level  of  the  growth  parameters.  Figure  B2  in  the 
Appendix  provides  a  partial  path  diagram  of  the  EF  component  of 
this  model. 

For  the  sake  of  comparability  to  our  panel  models,  we  used 
wave  in  the  study  as  loadings  for  these  models  (i.e.,  loadings  for 
the  linear  slope  were  0,  1,  2,  and  3,  for  Waves  1,  2,  3,  and  4, 
respectively).  This  approach  allowed  us  to  model  each  wave  of 
data  as  a  discrete  time  point  rather  than  taking  the  more  traditional 
approach  of  modeling  each  observation  of  each  child  as  occurring 
at  the  child’s  unique  age  at  the  assessment.  Given  participants’ 
relatively  narrow  age  range,  very  few  children  in  later  waves  were 
younger  than  children  measured  in  earlier  waves.  That  is,  child 
ages  did  not  substantially  overlap  across  waves. 

Results 

Descriptive  statistics  are  presented  in  Table  2.  Overall,  and  as 
expected,  children  improved  at  each  wave  of  the  study  on  EF  tasks, 
math,  and  literacy. 

Panel  Models 

The  initial  CFA  fit  the  data  well  (fit  for  all  panel  models  is 
presented  in  Table  3)  and  had  statistically  significant  factor  loadings 
for  all  indicators  of  EF  (all  ps  <  .001).  Modification  indices  did  not 
indicate  areas  of  localized  misfit.  An  initial  test  of  weak  (i.e.,  loading) 
invariance  substantially  decreased  model  fit  (A  CFI  =  —.02),  with 
modification  indices  suggesting  that  the  relation  between  EF  and 
the  Card  Sort  total  score  changed  across  waves  and  that  the 
relation  between  Working  Memory  and  EF  was  significantly  dif¬ 
ferent  at  Wave  4  than  in  the  other  waves.  Freely  estimating  the 
Card  Sort  factor  loading  for  waves  1  and  2  and  the  Working 
Memory  factor  loading  in  Wave  4  resulted  in  a  model  that  sup¬ 
ported  partial  weak  invariance  (A  CFI  =  -.005;  A  Bayesian 
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Table  3 

Fit  for  Panel  Models 


Panel  models 


Initial 

284.72 

170 

Weak  invariance 

387.Q9 

179 

Partial  weak  invariance 

319.45 

176 

Strong  invariance  and  initial  SEMb 

343.63 

185 

RMSEA 

90%  Cl  (RMSEA) 

CFI 

TLI 

BIC 

.04 

[.03,  .05] 

.979 

.96 

30786.98 

.05 

[.05,  .06] 

.962 

.93 

30831.28 

.04 

[.04,  .05] 

.974 

.95 

30783.33 

.05 

[.04,  .05] 

.971 

.95 

30754.58 

Note.  RMSEA  =  root-mean-square  error  of  approximation;  Cl  =  confidence  interval;  CFI  =  comparative  fit 
index;  TLI  =  Tucker— Lewis  Index;  SEM  =  structural  equation  model;  BIC  =  Bayesian  information  criterion. 
a  Models  were  estimated  using  robust  maximum  likelihood;  chi-squared  statistics  cannot  be  directly  com¬ 
pared.  b  Fit  for  these  models  was  identical  because  the  latent  covariance  structure  of  the  initial  SEM  was 
saturated. 


information  criterion  [BIC]  =  -3.65).  Equating  the  intercepts 
across  waves  in  this  model  further  supported  partial  strong  facto¬ 
rial  invariance  (A  CFI  =  —.003;  A  BIC  =  —28.75).  Table  4 
presents  factor  loadings  and  intercepts  from  the  strong  invariance 
model  and  highlights  which  parameters  were  freely  estimated 
versus  equated  across  time.  This  model  shows  that  HTKS  was  a 
relatively  strong  indicator  of  EF  across  waves,  Working  Memory 
was  a  stronger  indicator  of  EF  in  Wave  4  (relative  to  other  waves), 
and  the  Card  Sort  was  an  especially  strong  indicator  of  EF  at  Wave 
1 .  Latent  variances  and  correlations  from  this  model  are  presented 
in  Table  5,  with  actual  (rather  than  latent)  means  and  variances 
provided  for  the  math  and  literacy  scores. 


Table  4 


Factor  Loadings  and  Intercepts  from  the  Strong  Invariance 
CFA  Model 


Construct  indicator 

Standardized 

loading 

(SET 

Raw-metric 

loading 

(SETT 

Raw-metric 

intercepts 

(SET 

Time  1  EF 

Working  memory 

.34*  (.04) 

.51  (.06) 

45.37  (.10) 

Simon  Says 

.45*  (.04) 

.67  (.07) 

.90  (.10) 

HTKS 

.45*  (.06) 

.49  (.06) 

1.36  (.08) 

Card  sort 

.59*  (.04) 

1.30  (.09) 

5.04  (.15) 

Time  2  EF 

Working  memory 

.40*  (.03) 

equated  (Tl) 

equated  (Tl) 

Simon  Says 

.49*  (.03) 

equated  (Tl) 

equated  (Tl) 

HTKS 

.58*  (.04) 

equated  (Tl) 

equated  (Tl) 

Card  sort 

.55*  (.05) 

.76  (.09) 

equated  (Tl) 

Time  3  EF 

Working  memory 

.41*  (.03) 

equated  (Tl) 

equated  (Tl) 

Simon  Says 

.52*  (.04) 

equated  (Tl) 

equated  (T 1 ) 

HTKS 

.63*  (.04) 

equated  (Tl) 

equated  (Tl) 

Card  sort 

.52*  (.05) 

.54  (.07) 

equated  (Tl) 

Time  4  EF 

Working  memory 

.54*  (.04) 

.72  (.09) 

equated  (Tl) 

Simon  Says 

.54*  (.04) 

equated  (T 1 ) 

equated  (Tl) 

HTKS 

.66*  (.04) 

equated  (Tl) 

equated  (Tl) 

Card  sort 

.59*  (.04) 

equated  (T3) 

equated  (Tl) 

Note.  Indicators  were  divided  by  constants  to  make  their  variances  more 
homogenous,  expediting  model  convergence  (e.g.,  Muthen,  2010).  For 
more  information,  see  Appendix  A.  HTKS  =  Head-Toes-Knees- 
Shoulders  task;  EF  =  executive  function;  CFA  =  confirmatory  factor 
analysis. 

a  Loadings  are  somewhat  attenuated  because  covariates  were  controlled  at 
the  item  level. 

*p  <  .001. 


We  next  specified  a  cross-lagged  panel  SEM  and  tested  the 
assumption  of  no  longitudinal  covariances  above  and  beyond  those 
specified  by  the  lag-1  structural  regressions.  A  likelihood  ratio  test 
comparing  models  that  did  versus  did  not  allow  longitudinal  co- 
variances  between  EF,  math,  and  literacy  measures  separated  by 
two  or  more  lags  (e.g.,  EF  at  Time  1  and  math  at  Time  3)  supported 
this  assumption,  A  x2(27)  =  30.70,  p  =  .28;  A  BIC  =  —130.28. 
The  structural  component  of  this  final  model  is  illustrated  in 
Figure  1 .  This  path  diagram  omits  nonsignificant  regression  esti¬ 
mates  and  within-wave  covariances.  These  additional  details  are 
provided  in  Figure  B3  and  Table  B1  of  the  Appendix. 

Results  suggest  that  (a)  relative  standing  on  all  variables  was 
stable  (i.e.,  all  autoregressive  paths  were  statistically  significant  at 
p  <  .001),  with  EF  displaying  especially  high  stability  ((3s  ranged 
from  .75  to  .86);  (b)  that  changes  in  relative  standing  on  EF  and 
literacy  were  essentially  unrelated  across  waves  (i.e.,  low  cross- 
lagged  regression  coefficients);  (c)  that  EF  and  math  were  mutu¬ 
ally  influential  in  preschool  and  this  relation  shifted  in  kindergar¬ 
ten,  such  that  only  EF  predicted  math;  and  (d)  that  math  and 
literacy  were  not  consistently  related  across  time. 

LGCM 

The  initial  three-trajectory  model  indicated  a  non-positive- 
definite  latent  covariance  matrix  caused  by  collinearity  between 
the  EF  intercept  and  quadratic  slope  and  by  nonsignificant  residual 
variances  for  the  linear  slope  for  literacy  and  the  quadratic  slope 
for  math.  These  nonsignificant  residual  variances  suggest  an  over¬ 
fitted  model.  We  therefore  eliminated  the  collinearity  by  regress¬ 
ing  the  quadratic  slope  for  EF  on  the  intercept  for  EF  and  con¬ 
straining  the  residual  variance  of  the  quadratic  slope  to  zero.  We 
also  fixed  the  nonsignificant  residual  variances  to  zero.  These 
constraints  did  not  significantly  reduce  model  fit,  A  x2(23)  = 
29.68,  p  =  .16;  A  BIC  =  —107.65,  and  the  resulting  model  fit  the 
data  well,  x2(277)  =  574.57,  p  <  .001;  RMSEA  =  .05  [.05,  .06]; 
CFI  =  .95;  TLI  =  .93.  An  examination  of  the  modification  indices 
did  not  reveal  areas  of  extreme  local  misfit.  Table  6  contains  the 
estimated  means  and  variances  for  the  latent  growth  parameters 
from  this  model  and  clarifies  which  growth  parameters  were 
estimated  in  which  ways  (i.e.,  fixed  vs.  random  variances).  All 
growth  parameters  with  freely  estimated  variances  were  regressed 
on  the  covariates  and  allowed  to  covary  among  themselves.  Partial 
correlations  among  these  parameters  are  presented  in  Table  7. 
Fixed  growth  parameters  were  regressed  on  the  control  variables 
(i.e.,  age.  Head  Start  status,  and  ELL  status)  but  did  not  have  freely 
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Table  5 


Means,  Variances,  and  Correlations  for  Strong  Invariance  Model 


Consrtucts 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

M 

1.  EF1 

2.  Mathl 

.65* 

4.03 

.00 
41  71 

3.  Literacy  1 

.44* 

.39* 

5.45 

34  77 

4.  EF2 

.91* 

.71* 

.43* 

1.97 

1  ^0 

5.  Math2 

.68* 

.74* 

.40* 

.74* 

3.72 

42  71 

6.  Literacy2 

.36* 

.41* 

.77* 

.40* 

.41* 

5.66 

36  16 

7.  EF3 

.85* 

.70* 

.40* 

.92* 

.78* 

.34* 

2.36 

2.63 

8.  Math3 

.59* 

.67* 

.41* 

.66* 

.79* 

.40* 

.77* 

3.14 

43.67 

9.  Literacy3 

.36* 

.37* 

.70* 

.40* 

.35* 

.80* 

.36* 

.41* 

7.02 

37.60 

10.  EF4 

.71* 

.64* 

.32* 

.83* 

.69* 

.32* 

.92* 

.71* 

.34* 

2.18 

3.24 

1 1 .  Math4 

.60* 

.62* 

.41* 

.67* 

.70* 

.41* 

.71* 

.75* 

.41* 

.69* 

2.79 

44.74 

12.  Literacy4 

.43* 

.42* 

.60* 

.46* 

.45* 

.65* 

.43* 

.50* 

.78* 

.43* 

.48* 

10.69 

41.02 

Note.  Variances  on  diagonal,  correlations  below  diagonal.  Actual  (as  opposed  to  latent)  means  and  variances  provided  for  Math  and  Literacy.  Indicators 
were  divided  by  constants  to  make  their  variances  more  homogenous,  thus  expediting  model  convergence  (e.g.,  Muthen,  2010).  For  more  information,  see 
Appendix  A.  EF  =  executive  function. 

*p  <  .001. 


estimated  variances  and  did  not  covary  with  any  other  growth 
parameter.  The  correlations  in  Table  7  highlight  strong  associa¬ 
tions  among  the  intercepts  and  between  the  intercept  and  slope 
parameters.  Average  growth  trajectories  are  plotted  in  Figure  2. 

To  better  understand  how  the  constructs  at  Wave  1  (estimated  as 
the  intercepts)  may  have  impacted  the  results  of  the  LGCMs,  we 
took  the  additional  step  of  regressing  all  three  random  slopes  on  all 
three  random  intercepts.  This  final  model  allowed  us  to  examine 
how  absolute  changes  in  each  variable  were  correlated  after  con¬ 
trolling  for  initial  standing  on  each  (i.e.,  all  three  random  inter¬ 
cepts).  As  shown  in  Table  8,  the  residual  random  slope  for  math 
was  significantly  correlated  with  both  other  slopes,  although  the 
correlation  between  the  EF  and  literacy  slopes  was  not  statistically 
significant.  Thus,  after  children’s  initial  standing  was  accounted 
for  in  the  LGCM,  the  results  suggested  growth  in  EF  and  math 
were  associated  during  this  developmental  period. 


Discussion 

The  overarching  aim  of  the  current  study  was  to  examine  the 
longitudinal  relations  between  EF,  math,  and  literacy  across  four 
waves  of  measurement  spanning  preschool  and  kindergarten.  We 
employed  a  multi-analytic  approach,  first  using  a  cross-lagged 
panel  model  to  test  the  extent  to  which  relative  standing  on  EF, 
math,  and  literacy  were  related  across  time.  We  then  used  LGCMs 
to  test  whether  growth  in  our  constructs  were  associated.  As 
expected,  results  generally  demonstrated  significant  reciprocal  re¬ 
lations  and  correlated  growth  between  EF  and  math  as  well  as 
math  and  literacy,  but  not  between  EF  and  literacy.  Notably, 
results  from  our  panel  models  indicated  that  these  significant 
relations  may  change  over  time.  For  example,  EF  predicted  math 
but  math  did  not  predict  EF  during  the  kindergarten  school  year. 
These  findings  contribute  to  the  current  literature  by  demonstrating 


Time  1 


Math 


Time  2 


Time  3 


Time  4 


Literacy 
R2=  .61 


Literacy 


Math 
R2=  .65 


Math 


Literacy 
R2=  .66 


Literacy 

.69*’* 

R2=  .65 

Figure  1.  Path  diagram  for  our  final  structural  model  (standardized  coefficients).  Within-wave  covariances  and 
nonsignificant  regression  paths  not  shown).  Time  1  =  fall  of  preschool;  Time  2  =  spring  of  preschool;  Time  3  = 
fall  of  kindergarten;  Time  4  =  spring  of  kindergarten. 
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Table  6 

Estimated  Growth  Parameter  Conditional  Means  and  Variances 


Parameter 

M  ( SE ) 

Variance  (SE) 

Executive  function3 

Intercept 

1 .299  (.07)*** 

.28  (.05)*** 

Linear 

.91  (.07)*** 

.05  (.02)’ 

Quadratic 

-.13  (.02)*** 

.00  (FIXED) 

Literacy 

Intercept 

34.86  (.18)*** 

4.26  (.49)*** 

Linear 

.60  (.17)*** 

.00  (FIXED) 

Quadratic 

.48  (.05)’** 

.05  (.01)*** 

Math 

Intercept 

41.72  (.14)*** 

3.18  (.51)*** 

Linear 

.97  (.12)*** 

.10  (.03)** 

Quadratic 

.01  (.04) 

.00  (FIXED) 

Note.  Indicators  were  divided  by  constant  values  to  create  more  homog¬ 
enous  indicator  variances.  Values  in  this  table  therefore  provide  meaning¬ 
ful  information  about  the  shape  of  each  growth  trajectory  but  do  not 
describe  scores  in  their  raw  metric. 

a  Calculated  as  the  estimated  intercept  (.02)  plus  the  conditional  mean  of 
the  executive  function  (EF)  intercept  (1.299)  multiplied  by  the  regression 
coefficient  regressing  the  EF  quadratic  slope  in  the  EF  intercept  (—.12). 
>  <  .05.  '"><.01.  **><.001. 

a  bidirectional  association  and  correlated  growth  between  EF,  a 
more  domain-general  set  of  cognitive  processes,  and  math,  a 
domain-specific  skill.  These  results  have  implications  for  research 
on  curriculum  development  and  intervention  design.  Further,  this 
study  adds  to  the  theoretical  discourse  surrounding  the  develop¬ 
ment  of  EF  and  academic  skills  in  early  childhood. 

Bidirectional  Relations  Between  EF,  Math,  and 
Literacy:  Cross-Lagged  Panel  Models 

Consistent  with  previous  research  (Blair  &  Razza,  2007;  Bull  et 
al.,  2008;  Bull,  Johnston,  &  Roy,  1999;  McClelland  et  al.,  2007), 
our  panel  models  suggested  that  EF  is  a  significant  predictor  of 
math  in  preschool  and  kindergarten.  These  findings  provide  sup¬ 
port  for  the  notion  that  EF  may  be  foundational  for  the  develop¬ 
ment  of  important  early  math  skills.  In  addition,  and  also  consis¬ 
tent  with  a  recent  study  (Fuhs  et  al.,  2014),  these  results 
demonstrated  reciprocal  associations  between  EF  and  math  during 
preschool  and  as  children  transition  into  kindergarten  (i.e.,  from 
the  spring  of  preschool  to  the  fall  of  kindergarten).  These  findings 
suggest  that  EF  may  not  only  be  important  for  the  development  of 
math,  but  that  math  may  also  be  important  for  the  development  of 
EF  during  this  time.  Thus,  it  is  possible  that  math  skills  are 
foundational  for  growth  in  EF.  Essentially,  the  ability  to  pay 
attention,  remember  complex  rules,  and  persist  on  challenging 
tasks  may  help  children  perform  better  on  math  tasks  (McClelland 
et  al.,  2007)  and,  conversely,  strong  math  skills  (e.g.,  solving 
complicated  math  problems)  may  contribute  to  children’s  ability  to 
sustain  attention,  remember  a  series  of  rules,  and  inhibit  incorrect 
responses  on  complex  EF  tasks  (Fuhs  et  al.,  2014). 

With  the  addition  of  a  fourth  time  point  at  the  beginning  of 
kindergarten  (in  comparison  to  prior  research),  we  were  able  to 
extend  the  existing  literature  and  identify  at  which  point  during 
preschool  and  kindergarten  relations  between  EF  and  math  may 
change.  Findings  revealed  that  although  math  and  EF  were  recip¬ 
rocally  related  during  preschool  and  during  the  transition  to  kin¬ 


dergarten,  this  bidirectional  relation  faded  during  the  kindergarten 
year.  Specifically,  during  the  kindergarten  year  (between  Waves  3 
and  4),  EF  in  the  fall  remained  a  significant  predictor  of  math  in 
the  spring,  but  not  vice  versa.  Changes  in  the  relations  between  EF 
and  math  may  be  due  to  factors  associated  with  preschool  and/or 
kindergarten  instruction.  In  kindergarten,  children  are  charged 
with  more  challenging  math  tasks  and  they  may  need  to  call  upon 
EF  skills  to  resist  the  natural  inclination  to  either  give  up  and 
abandon  a  task  or  use  a  less  efficient  previously  learned  rule  (Bull 
et  al.,  1999).  In  contrast,  mathematics  instruction  in  preschool  is 
often  limited  in  complexity  and  focused  around  a  narrow  range  of 
activities  (e.g.,  counting;  Ginsburg,  Lee,  &  Boyd,  2008).  It  may  be 
the  case  that,  in  preschool — where  limited  mathematics  instruction 
is  provided — children  who  have  higher  levels  of  math  skills  are 
engaged  in  instructional  activities  that  provide  the  opportunity  for 
them  to  develop  higher  EF  skills  and,  in  turn,  those  children  with 
higher  levels  of  EF  are  better  able  to  acquire  the  limited  mathe¬ 
matics  instructional  information  that  is  provided.  These  instruc¬ 
tional  differences  may  explain  why  the  bidirectional  relationship 
(EF  < - »  math)  emerges  during  preschool  and  fades  during  kin¬ 

dergarten,  when  children  begin  experiencing  more  uniform  and 
frequent  math  instruction  during  kindergarten. 

Taken  together,  these  findings  have  potential  implications  for 
the  development  and  evaluation  of  instructional  strategies  and 
interventions  that  are  designed  to  improve  either  EF  or  math.  In 
preschool,  it  may  be  more  beneficial  for  children  if  teachers  target 
both  EF  and  math  simultaneously,  whereas  in  kindergarten,  focus¬ 
ing  instructional  efforts  on  EF  as  a  foundational  skill  set  may  be 
more  important.  Additionally,  these  findings  suggest  that  children 
who  enter  kindergarten  with  low  levels  of  EF  may  be  at  risk  for 
academic  difficulties  and  in  need  of  extra  instructional  supports  or 
intervention.  Critically,  the  causal  nature  of  such  instructional 
strategies  need  to  be  evaluated  experimentally. 

In  contrast  to  math,  the  panel  model  indicated  that  relative 
standing  on  EF  and  literacy  were  essentially  unrelated  across 
waves.  These  findings  are  not  surprising  given  inconsistent  links 
between  EF  and  literacy  in  previous  studies  (Blair  &  Razza,  2007; 
Blair  et  al.,  2015;  Cameron  Ponitz  et  al.,  2009;  Schmitt  et  al., 
2014)  and  nonsignificant  bidirectional  associations  in  recent  work 
(Fuhs  et  al.,  2014).  Several  speculations  as  to  why  associations  are 
stronger  for  EF  and  math  than  EF  and  literacy  have  been  intro¬ 
duced  in  recent  literature.  For  example,  some  argue  that  math 
content  and  activity  place  more  cognitive  demands  on  children 
and,  thus,  require  stronger  EF  skills  to  master  (Bull  et  al.,  2008; 


Table  7 

Partial  Correlations  Between  Growth  Parameters 


Parameters 

1 

2 

3 

4  5  6 

1.  EF — Intercept 

_ 

2.  EF — Linear 

— 

3.  Literacy — Intercept 

.48*** 

.38** 

— 

4.  Literacy — Quadratic 

.20** 

.31** 

.08 

— 

5.  Math — Intercept 

.81*** 

.82*** 

49*** 

.25*  — 

6.  Math — Linear 

-.40** 

-.17 

-.14 

.12  -.53***  — 

Note.  The  raw-metric  regression  of  the  executive  function  (EF)  quadratic 
slope  on  the  EF  intercept  was  -.12  {p  <  .01).  A  parallel  model  that  relied 
on  numerical  integration  provided  a  standardized  coefficient  of  -  91 
><.05.  *><.01.  ”><.001. 
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Executive  Functioning 


Reading 


Figure  2.  Average  growth  trajectories  from  the  latent  growth  curve  modeling  (LGCM).  Math  and  literacy 
scores  were  rescaled  from  original  values. 


Clark  et  al.,  2010;  Espy  et  al.,  2004;  Willoughby,  Blair,  et  al., 

2012) .  A  second  explanation  is  that  EF  is  a  foundational  skill  set 
that  supports  growth  in  reasoning  abilities  (Richland  &  Burchinal, 

2013) .  Higher-order  reasoning  skills  are  necessary  to  succeed  on 
math  tasks  that  require  children  to  solve  complex  story  or  word 
problems  (e.g.,  “Katie  had  three  balls.  One  of  them  rolled  away. 
Now  how  many  does  she  have?”;  Blair  et  al.,  2015).  In  contrast, 
literacy  tasks  typically  assess  children’s  knowledge,  making  fewer 
demands  on  reasoning  abilities  and  EF.  Others  argue  that  differ¬ 
ences  in  academic  focus  in  early  childhood  classrooms  could  play 
a  role  in  explaining  differences  in  the  development  of  math  versus 
literacy  (Cameron  Ponitz  et  al.,  2009).  Extant  research  suggests 
that  preschool  teachers  spend  more  time  engaged  in  literacy  in¬ 
struction  compared  to  math  instruction  (Layzer,  Goodsen,  &  Moss, 
1993;  Skibbe,  Hindman,  Connor,  Housey,  &  Morrison,  2013). 
Children  may,  therefore,  have  to  engage  in  math  activities  (e.g., 
patterning  during  free  play)  spontaneously  and  independently  dur¬ 
ing  the  school  day,  which  may  require  higher  levels  of  EF.  Sim¬ 
ilarly,  parents  report  engaging  in  significantly  more  literacy  activ¬ 
ities  at  home  than  math  activities  (Cannon  &  Ginsburg,  2008; 
Skwarchuk,  Sowinski,  &  LeFevre,  2014).  Parents  who  believe 


Table  8 

Correlations  Between  Slopes,  Conditional  on  Intercepts 
and  Covariates 


Parameter 

1 

2 

3 

1.  EF — Linear 

2.  Literacy — Quadratic 

3.  Math — Linear 

.21 

.63** 

.32** 

**p  <  .01. 

their  children  are  more  academically  ready  may  engage  their 
children  in  more  cognitively  demanding  math  activities  at  home 
(DeFlorio  &  Beliakoff,  2015).  The  greater  consistency  of  literacy 
activities  at  home  and  school  may  contribute  to  its  overall  distinc¬ 
tion  from  growth  in  EF. 

Another  aspect  of  our  research  question  was  to  investigate 
bidirectional  relations  between  math  and  literacy  skills  across  the 
preschool  and  kindergarten  years.  Somewhat  contrary  to  our  ex¬ 
pectations,  these  relations  were  weaker  in  preschool  and  became 
bidirectional  during  the  kindergarten  year.  These  differences  in  the 
findings  compared  to  expectations  also  are  likely  due  to  instruc¬ 
tional  practices.  In  contrast  to  the  divergence  of  the  relation  be¬ 
tween  math  and  EF,  there  may  be  a  convergence  in  the  relation 
between  math  and  literacy  as  instruction  in  both  domains  becomes 
more  parallel  in  quantity.  In  preschool,  children  are  generally 
exposed  to  more  literacy  instruction  compared  to  math  instruction. 
In  contrast,  in  kindergarten,  math  and  literacy  instruction  become 
more  uniform  and  consistent,  and  all  children  are  typically  exposed 
to  the  same  quantity  of  instruction  for  both  academic  domains. 
This  parallel  exposure  likely  allows  children  to  draw  on  concepts 
learned  from  the  instruction  in  the  other  domain  (e.g.,  being  able 
to  read  a  word  problem  allows  children  to  complete  the  math  task) 
and  thus,  the  relation  between  math  and  literacy  may  be  strength¬ 
ened. 

Correlated  Growth  Between  EF,  Math, 
and  Literacy:  LGCMs 

To  further  investigate  the  longitudinal  associations  between  EF, 
math,  and  literacy,  we  employed  a  second  analytic  approach: 
LGCM.  Consistent  with  prior  evidence  (e.g.,  McClelland  et  al., 
2007),  these  models  indicated  that  the  latent  intercepts  (a  proxy  for 


p  <  .01. 
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where  children  started)  for  EF,  math,  and  literacy  were  all  signif¬ 
icantly  correlated,  suggesting  that  performance  relative  to  peers 
was  consistent  across  measures.  Also,  consistent  with  prior  re¬ 
search  (McClelland  et  al.,  2007;  Schmitt  et  al.,  2014),  initial  levels 
of  EF  and  math  were  more  highly  correlated  than  EF  and  literacy. 
However,  whether  the  coupling  of  the  three  variables  is  a  result  of 
unidirectional  causality,  bidirectional  causality,  or  the  result  of 
unmeasured  third  variables  is  not  clear. 

In  terms  of  cross-domain  relations  in  growth,  the  final  LGCM 
indicated  that,  after  controlling  for  initial  standing  on  each  con¬ 
struct  (i.e.,  all  latent  intercepts),  the  latent  EF  and  math  slopes  were 
positively  correlated.  In  contrast,  results  revealed  a  nonsignificant 
relation  between  growth  in  EF  and  literacy.  This  finding  is  in  line 
with  prior  studies  demonstrating  that  the  longitudinal  association 
between  EF  and  math  is  more  robust  than  EF  and  literacy  (Blair  et 
al.,  2015;  Cameron  Ponitz  et  al.,  2009).  This  finding  also  supports 
our  earlier  assertion  that  engaging  in  math  activities  may  be  a 
context  in  which  children  are  able  to  expand  their  EF  and  that 
domain-specific  differences  in  instruction  during  the  preschool  and 
kindergarten  years  may  account  for  these  differential  patterns  of 
growth.  For  example,  over  the  last  two  decades,  there  has  been  a 
strong  emphasis  on  early  literacy  instruction  in  both  preschool  and 
kindergarten.  Indeed,  previous  research  indicates  a  strong  school¬ 
ing  effect  for  children’s  literacy  development  (Burrage  et  al.,  2008; 
Christian,  Bachman,  &  Morrison,  2001).  Due  to  this  emphasis  on 
literacy  instruction,  children  may  not  need  to  call  upon  their  EF  as 
much  when  engaging  in  literacy  activities,  and  thus,  improvement 
in  EF  would  be  less  likely  to  be  related  to  improvement  in  literacy 
during  this  time  frame.  Finally,  the  latent  math  and  literacy  slopes 
were  significantly  related,  providing  additional  evidence  that  early 
math  and  literacy  skills  codevelop  over  the  preschool  and  kinder¬ 
garten  years  (Duncan  et  al.,  2007;  LeFevre  et  al.,  2010;  Purpura  et 
al.,  2011). 

Conclusions  From  the  Integration  of  Both 
Analytic  Approaches 

Results  from  the  two  analytic  approaches  provide  a  similar  story 
with  regard  to  our  overarching  research  question.  Both  the  panel 
model  and  the  LGCM  suggested  positive  correlations  between 
initial  levels  of  EF,  math,  and  literacy.  Thus,  and  consistent  with 
previous  research  (McClelland  et  al.,  2007;  Schmitt  et  al.,  2014), 
there  is  strong  evidence  that  these  three  constructs  are  tightly 
coupled  by  the  time  children  enter  preschool.  However,  both  sets 
of  results  also  suggest  that  EF  and  math  are  consistently  related 
over  time,  whereas  the  association  between  EF  and  literacy  is 
weak.  Taken  together,  the  LGCM  and  panel  model  therefore 
suggest  that  some  early  factor  (math  or  an  outside  variable)  likely 
helps  explain  the  correlation  between  EF  and  literacy.  The  devel¬ 
opment  of  EF  and  literacy  seem  to  be  driven  by  separate  processes 
during  the  transition  to  kindergarten,  however. 

Limitations  and  Future  Directions 

Although  this  study  extends  existing  literature  on  the  relations 
between  EF  and  early  academic  skills,  there  are  also  several 
limitations.  First,  we  utilized  several  measures  of  EF  in  our  study 
but  only  one  measure  each  for  math  (Applied  Problems)  and 
literacy  (Letter-Word  Identification).  These  subtests  measure  spe¬ 


cific  components  of  math  (e.g.,  counting,  calculation)  and  literacy 
(e.g.,  decoding,  word-reading)  and  may  therefore  not  represent 
comprehensive  growth  in  these  broader  academic  domains.  It  will 
be  important  for  future  studies  to  include  additional  measures  of 
early  academic  skills  to  further  our  understanding  of  how  complex 
skills  like  math  and  literacy  develop.  For  instance,  other  research 
has  shown  that  the  relations  between  EF  and  math  differ  based  on 
the  distinct  subcomponents  of  math  that  were  measured  (Lan, 
Legare,  Cameron  Ponitz,  &  Morrison,  2011;  Purpura  &  Ganley, 
2014).  A  comparison  of  more  targeted  relations  was  not  possible  in 
the  current  study  due  to  our  use  of  only  one  measure  each  for  math 
and  literacy.  Moreover,  utilizing  multiple  measures  of  math  in 
future  studies  will  help  elucidate  the  extent  to  which  EF  actually 
differentially  predicts  components  of  math  at  different  ages.  In¬ 
deed,  as  the  Applied  Problems  subtest  becomes  more  challenging, 
demands  on  EF  become  stronger.  Changes  in  the  relations  between 
EF  and  math  at  different  ages  may  not  necessarily  mean  EF  is  a 
better  or  worse  predictor  of  math,  but  that  changes  in  these 
relations  are  related  to  the  mathematics  concepts  targeted  within 
specific  assessment  measures. 

Second,  as  noted  above,  the  quantity  of  instruction  may  have 
varied  across  time  for  specific  domains  (particularly  for  math),  and 
these  differences  may  have  altered  the  relations  between  domains. 
For  example,  more  time  spent  engaging  in  math  instruction  may 
affect  the  development  of  math,  which,  in  turn,  could  change  the 
relations  between  math  and  EF  or  between  math  and  literacy.  In 
the  current  study,  math  and  literacy  instructional  practices,  activ¬ 
ities  in  schools  and  at  home,  or  active  learning  in  these  domains 
were  not  assessed.  Moreover,  other  contextual  factors  as  well  as 
individual  child  characteristics  not  measured  in  this  study,  such  as 
parenting  practices,  early  language  abilities,  or  motor  develop¬ 
ment,  may  be  contributing  to  growth  in  EF  and  academic  skills 
(McClelland  et  al.,  2015).  Further  research  that  includes  contextual 
factors  and  additional  child  characteristics  may  enhance  our  un¬ 
derstanding  of  the  linked  development  across  these  domains. 

Third,  recent  research  suggests  that  cross-lagged  panel  models 
can  produce  biased  estimates  due  to  unmodeled  trait-like  stability 
(e.g.,  Hamaker,  Kuiper,  &  Grasman,  2015).  Although  the  present 
analyses  used  a  likelihood  ratio  test  to  show  no  evidence  of 
additional  trait-like  stability  (i.e.,  by  constraining  the  correlations 
between  factors  separated  by  more  than  one  lag  to  be  zero),  it  will 
be  critical  for  future  research  to  explore  alternative  model  speci¬ 
fications  when  investigating  EF  and  academic  outcomes  over  time. 
Future  studies  should  also  test  for  mediating  effects  (e.g.,  via  panel 
models),  as  our  findings  suggest  that  EF  may  partially  mediate  the 
relation  between  math  in  preschool  and  math  in  kindergarten. 

Fourth,  it  is  important  to  note  that  there  was  attrition  across  the 
four  waves  of  data,  particularly  as  children  were  transitioning  from 
preschool  to  kindergarten  (between  Times  2  and  3).  Although  we 
accounted  for  missing  data  by  using  robust  maximum  likelihood 
and  included  Head  Start  status  in  all  of  our  models  (which  pre¬ 
dicted  missingness  between  these  waves),  different  patterns  of 
reciprocal  relations  in  preschool  and  kindergarten  may  be  due  to 
attrition. 

Finally,  although  our  sample  was  diverse  in  terms  of  socioeco¬ 
nomic  status,  it  was  less  ethnically  diverse.  We  relied  on  a  con¬ 
venience  sample  for  the  present  analyses,  and  future  research  is 
needed  to  replicate  our  findings  with  more  representative  and 
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ethnically  diverse  samples  to  determine  whether  or  not  the  findings 
generalize  to  other  populations. 

Conclusions 

Findings  from  this  study  have  potential  implications  for  instruc¬ 
tion  and  intervention  development  that  need  to  be  investigated  in 
a  more  targeted  manner.  It  may  be  important  to  consider  the  EF 
demands  on  mathematical  instruction  at  these  ages.  The  relation 
between  EF  and  math  may  be  something  that  can  be  capitalized  on 
through  instruction.  Integrating  the  domains  at  a  very  targeted 
level  (e.g.,  that  includes  appropriate  individual  scaffolding)  may  be 
a  useful  mechanism  for  enhancing  success  across  domains.  Fur¬ 
ther,  intervention  efforts  focused  on  EF  (or  math)  may  also  have  a 
beneficial  effect  on  children’s  math  (or  EF)  development.  Al¬ 
though  our  analyses  preclude  causality,  the  bidirectional  associa¬ 
tions,  as  well  as  correlated  growth  trajectories,  between  EF  and 
math  suggests  that  interventions  and  programs  that  contain  both 
EF  and  academic  training,  particularly  in  math,  may  be  a  potential 
avenue  for  affecting  change  during  the  transition  to  kindergarten. 
Future  research  examining  causal  connections  between  these  do¬ 
mains  at  a  more  nuanced  level  is  needed. 

Findings  from  this  study  also  suggest  that,  without  intervention, 
children’s  relative  standing  on  EF,  math,  and  literacy  assessments 
are  fairly  stable  over  time.  This  finding  has  implications  for  future 
theoretical  work  examining  the  development  of  these  constructs. 
More  research  is  needed  to  identify  predictors  of  these  skills  prior 
to  and  during  preschool  at  the  biological,  familial,  and  socioeco¬ 
nomic  levels. 

In  sum,  the  current  study  replicates  and  extends  current  litera¬ 
ture  exploring  EF,  math,  and  literacy.  Unlike  previous  work,  we 
used  a  multi-analytic  approach  and  found  converging  evidence  for 
the  longitudinal  relations  between  EF  and  math  and  weaker  rela¬ 
tions  between  EF  and  literacy.  These  findings  expand  upon  what 
was  found  in  the  study  conducted  by  Fuhs  and  colleagues  (2014). 
With  the  addition  of  a  fourth  time  point  at  the  beginning  of 
kindergarten,  we  were  able  to  contribute  to  current  research  by 
improving  the  specificity  of  the  relations  between  EF  and  aca¬ 
demic  skills  by  identifying  at  which  points  the  relations  change 
during  the  transition  to  kindergarten  at  a  more  fine-grained  level. 
Changes  in  these  relations  may  be  due  to  factors  within  the 
preschool  and  kindergarten  classrooms,  such  as  instructional  meth¬ 
ods  and  alignment  to  children’s  needs,  or  due  to  the  constructs 
being  assessed  at  those  ages.  This  change  in  relation  is  important 
for  the  development  of  instructional  strategies  and  interventions 
that  aim  to  improve  either  EF  or  math.  In  preschool,  it  may  be 
more  efficacious  to  target  both  EF  and  math  simultaneously, 
whereas  in  kindergarten,  targeting  EF  as  a  foundational  skill  set 
may  be  more  important.  Alternatively,  there  may  be  differential 
relations  between  aspects  of  EF  and  mathematics  where  EF  is  only 
related  to  certain  mathematics  skills  (Lan  et  al.,  2011;  Purpura, 
Schmitt,  &  Ganley,  2017;  Purpura  &  Ganley,  2014)  that  affect  this 
relation.  These  differential  relations  may  need  to  be  accounted  for 
in  intervention  and  curricular  development.  Nonetheless,  findings 
from  both  sets  of  analyses  suggest  that  fostering  the  development 
of  EF  and  early  math  skills  during  the  transition  to  kindergarten 
may  be  a  potentially  important  avenue  for  promoting  school  read¬ 
iness  and  fostering  academic  success  that  needs  to  be  investigated 
more  thoroughly. 
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Appendix  A 

SAS  Code  for  Dividing  Items  by  Constants 


DATA  use;  SET  use; 

/* DIVIDE  HTKS  SUMS  BY  15V 
htksl  =  sum  (of  htkssl_l  htkss2_l  htkss3_l)/75; 
htks2  =  sum  (of  htkssl_2  htkss2_2  htkss3_2)/75; 
htks3  =  sum  (of  htkssl_3  htkss2_3  htkss3_3)/75; 
htks4  =  sum  (of  htkssl_4  htkss2_4  htkss3_4)/75; 

/* DIVIDE  DCCS  SUMS  BY  3V 

dccsl  =  sum  (of  dccssl_l  dccss2_l  dccss3_l  dccss4_l)/3; 
dccs2  =  sum  (of  dccssl_2  dccss2_2  dccss3_2  dccss4_2)/3; 
dccs3  =  sum  (of  dccssl_3  dccss2_3  dccss3_3  dccss4_3)/3; 
dccs4  =  sum  (of  dccssl_4  dccss2_4  dccss3_4  dccss4_4)/3; 

/  'DIVIDE  ALL  WJ  SCORES  BY  10V 
wjapw_l  =  wjapw_l/70; 
wjapw_2  =  wjapw_2//0; 
wjapw_3  =  wjapw_3/70; 
wjapw_4  =  wjapw_4/70; 
wjlww_l  =  wjlww_l/70; 
wjlww_2  =  wjlww_2/70; 
wjlww_3  =  wjlww_3/70; 
wjlww_4  =  wjlww_4/70; 
wjpvw_l  =  wjpvw_l/70; 
wjpvw_2  =  wjpvw_2/70; 
wjpvw_3  =  wjpvw_3/70; 
wjpvw_4  =  wjpvw_4/70; 
wjwmw_l  =  wjwmw_l/70; 
wjwmw_2  =  wjwmw_2/70; 
wjwmw_3  =  wjwmw_3/70; 
wjwmw_4  =  wjwmw_4/70; 

r RECENTER  AGE  AT  4.5  YRSV 
ageyrs_l  =  ageyrs_  1-4.5, • 
ageyrs_2  =  ageyrs_2-4.5; 
ageyrs_3  =  ageyrs_3-4.5; 
ageyrs_4  =  ageyrs_4  -4.5; 

/  *DROP  INDIVIDUAL  ITEMS,  ANALYSIS  AT  COMPOSITE  LEVEL  ONLY*/ 

drop  htkssl_l  htkssl_2  htkssl_3  htkssl_4 

htkss2_l  htkss2_2  htkss2_3  htkss2_4 

htkss3_l  htkss3_2  htkss3_3  htkss3_4 

dccssl_l  dccss2_l  dccss3_l  dccss4_l 

dccssl_2  dccss2_2  dccss3_2  dccss4_2 

dccssl_3  dccss2_3  dccss3_3  dccss4_3 

dccssl_4  dccss2_4  dccss3_4  dccss4_4 

hstart_2  cspan_2  cspan_3  cspan_l; 

RUN; 
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Appendix  B 

Additional  Tables  and  Figures 


Table  B1 

Residual  Correlations  from  Final  Structural  Equation  Model 


Construct 

1 

2 

3 

4 

1.  EF1 

2.  Math  1 

3.  Literacy  1 

1.00 

0.66*** 

0.45*’* 

4.01 

0.40*** 

5.46 

4.  EF2 

— 

— 

— 

.28 

5.  Math2 

— 

— 

— 

00.25* 

6.  Literacy2 

— 

— 

— 

0.11 

7.  EF3 

— 

— 

— 

— 

8.  Math3 

— 

— 

— 

— 

9.  Literacy 3 

— 

— 

— 

— 

10.  EF4 

— 

— 

— 

— 

1 1 .  Math4 

— 

— 

— 

— 

12.  Literacy4 

— 

— 

— 

— 

5  6  7  8  9  10  11 


1.43 

0.10  2.20 

—  —  0.29 

—  —  0  1.11 

—  _  0.05  0.16*’  2.43 

—  —  —  —  —  0.40 

—  —  —  —  —  0.13  1.01 

—  —  —  —  —  0.06  0.06 


12 


3.76 


Note.  A  dash  indicates  values  were  not  estimated.  Variances  and  residual  variances  on  diagonal,  correlations  below  diagonal.  Squared  factor  loadings 
(therefore  representing  item  variances)  provided  for  math  and  literacy.  Indicators  were  divided  by  constants  to  make  their  variances  more  homogenous,  thus 
expediting  model  convergence  (e.g.,  Muthen,  2010).  Dashes  indicate  that  the  data  were  not  obtained/reported.  EF  =  executive  functioning. 

>  <  .05.  *><.01.  **><.001. 


Table  B2 

Comparison  of  BIC  for  Final  Panel  Model  and  Growth  Curve  (N  =  424) 


Model 

BIC 

Panel  model 

30,624.303 

Growth  curve 

30,435.914 

Note.  BIC  =  Bayesian  information  criterion. 
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Figure  Bl.  Path  diagram  representing  the  executive  functioning  (EF)  component  of  the  initial  confirmatory 
factor  analysis  (CFA).  Mean  structure  and  indicator  residuals  are  omitted  from  the  diagram,  but  all  indicator 
residuals  and  intercepts  were  freely  estimated.  All  latent  means  were  fixed  to  zero.  All  indicators  were  controlled 
for  covariates  (not  shown). 


Figure  B2.  Path  diagram  representing  the  executive  functioning  (EF)  component  of  the  initial  latent  growth 
curve  modeling  (LGCM).  Mean  structure  and  indicator  residuals  are  omitted  from  the  diagram,  but  all  indicator 
residuals  were  freely  estimated,  with  factor  loadings  and  indicator  intercepts  estimated  but  equated  across  time. 
Head-Toes-Knees-Shoulders  task  (HTKS)  served  as  a  marker  variable,  with  its  loading  fixed  to  1.00  and 
intercept  fixed  to  0.00.  All  latent  means  for  EF  were  fixed  to  zero  and  means  for  all  growth  parameters  (intercept 
and  two  slopes)  were  freely  estimated.  All  growth  parameters  were  controlled  for  covariates  (not  shown). 


(. Appendices  continue ) 
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Time  1 


EF 


.75*' 


Time  2 

Time  3 

Time  4 

EF  \ 

.76***  (  EF  \ 

.86***  f  EF 

R2=  .86  i 

k 

OO 

oc 

II 

r'& 

r" 

A.  R2=  .81 

.22* 


.03 


.33* 


.25* 


-.06 


.20* 


.04 


.39* 


Math 

\X.50***, 

Math 

\  /  62***f 

Math 

R2  =  .65 

\/.40**; 

Math 
_  2 

R2  =  .62 

/  \  HA  - 

R  =  .63 

^<^06 

16*\* 

.08 

y/ \  .16* 

.08 

Literacy 

Literacy 

-,uo 

Literacy 

- - - -P- 

Literacy 

2 

.74***  1 

R2=  .61 

.77’** 

R2  =  .66 

.69*** 

R  =  .65 

*  p<.05 

**  p  <  .01 
***  p  <  .001 


Figure  B3.  Path  diagram  for  the  final  structural  model  (standardized  coefficients).  Variances,  residual  vari¬ 
ances,  and  within-wave  covariances  provided  in  Table  Bl.  Time  1  =  fall  of  preschool;  Time  2  —  spring  of 
preschool;  Time  3  =  fall  of  kindergarten;  Time  4  =  spring  of  kindergarten. 
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Achievement  Goals,  Reasons  for  Goal  Pursuit,  and  Achievement  Goal 
Complexes  as  Predictors  of  Beneficial  Outcomes:  Is  the  Influence  of  Goals 

Reducible  to  Reasons? 

Nicolas  Sommet  Andrew  J.  Elliot 
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In  the  present  research,  we  proposed  a  systematic  approach  to  disentangling  the  shared  and  unique  variance 
explained  by  achievement  goals,  reasons  for  goal  pursuit,  and  specific  goal-reason  combinations  (i.e., 
achievement  goal  complexes).  Four  studies  using  this  approach  (involving  nearly  1,800  participants)  led  to  3 
basic  sets  of  findings.  First,  when  testing  goals  and  reasons  separately ,  mastery  (-approach)  goals  and 
autonomous  reasons  explained  variance  in  beneficial  experiential  (interest,  satisfaction,  positive  emotion)  and 
self-regulated  learning  (deep  learning,  help-seeking,  challenging  tasks,  persistence)  outcomes.  Second,  when 
testing  goals  and  reasons  simultaneously,  mastery  goals  and  autonomous  reasons  explained  independent 
variance  in  most  of  the  outcomes,  with  the  predictive  strength  of  each  being  diminished.  Third,  when  testing 
goals,  reasons,  and  goal  complexes  together,  the  autonomous  mastery  goal  complex  explained  incremental 
variance  in  most  of  the  outcomes,  with  the  predictive  strength  of  both  mastery  goals  and  autonomous  reasons 
being  diminished.  Comparable  results  were  observed  for  performance  (-approach)  goals,  the  autonomous 
performance  goal  complex,  and  performance  goal-relevant  outcomes.  These  findings  suggest  that  achievement 
goals  and  reasons  are  both  distinct  and  overlapping  constructs,  and  that  neither  unilaterally  eliminates  the 
influence  of  the  other.  Integrating  achievement  goals  and  reasons  offers  the  most  promising  avenue  for  a  full 
account  of  competence  motivation. 


Educational  Impact  and  Implications  Statement 

The  present  research  seeks  to  disentangle  the  influence  of  “what”  individuals  want  to  achieve  (type  of 
goals),  “why”  they  want  to  achieve  (type  of  reasons),  and  specific  “what”  and  “why”  combinations  (type 
of  goal-reason  combinations).  In  four  studies,  we  showed  that  mastery  goals  (striving  for  task  mastery), 
autonomous  reasons  (striving  because  it  is  stimulating  and  valued),  and  a  specific  mastery  goal — 
autonomous  reason  combination  (striving  for  task  mastery  because  it  is  stimulating  and  valued)  all  made 
separate  positive  contributions  to  beneficial  achievement-relevant  outcomes  (e.g.,  interest,  positive  emo¬ 
tion,  deep  learning).  Comparable  results  were  observed  for  performance  goals  (striving  to  outperform 
others)  and  a  specific  performance  goal — autonomous  reason  combination  (striving  to  outperform  others 
because  it  is  stimulating  and  valuable).  The  present  findings  indicate  that  both  type  of  goals  and  type  of 
reasons  are  important  for  a  full  understanding  of  achievement  motivation. 


Keywords:  achievement  goal,  autonomous  and  controlled  reasons,  self-determination  theory,  achieve¬ 
ment  goal  complex 


The  achievement  goal  approach  provides  a  framework  for  under¬ 
standing  the  direction  of  behavior,  addressing  the  question  of  what 
individuals  want  to  achieve  (Dweck,  1986;  Maehr  &  Nicholls,  1980; 
Nicholls,  1984).  However,  a  complete  conceptual  framework  of 
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achievement  motivation  must  also  account  for  the  energization  of 
behavior,  addressing  the  question  of  why  individuals  want  to  achieve 
(Elliot  &  Thrash,  2001). 

The  “whys”  (i.e.,  reasons)  behind  achievement  goals  can  be  con¬ 
ceptualized  in  many  ways  (e.g.,  social  values,  achievement  motives, 
Dompnier,  Damon,  &  Butera,  2009;  McClelland,  1985).  However,  in 
recent  years  researchers  have  focused  mostly  on  reasons  derived  from 
self-determination  theory  (SDT,  Ryan  &  Deci,  2000).  In  several 
studies,  researchers  have  reported  that  the  influence  of  achievement 
goals  on  beneficial  outcomes  is  no  longer  statistically  significant 
when  partialing  out  the  variance  explained  by  the  SDT-derived  rea¬ 
sons  connected  with  the  achievement  goals  (for  a  review,  see  Vans- 
teenkiste,  Lens,  Elliot,  Soenens,  &  Mouratidis,  2014).  These  findings 
are  sometimes  interpreted  as  indicating  that  the  influence  of  achieve¬ 
ment  goals  is  reducible  to  the  reasons  behind  them,  thereby  question¬ 
ing  the  importance  of  achievement  goals  in  the  study  of  motivation. 
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In  the  present  research,  we  take  a  step  back  to  carefully  examine 
this  empirical  work  and  to  reconsider  the  conclusions  that  can  be 
drawn  from  it.  We  propose  a  systematic  approach  for  studying 
achievement  goals,  reasons,  and  specific  achievement  goal-reason 
combinations  (i.e.,  achievement  goal  complexes;  Elliot  &  Thrash, 
2001).  We  use  this  approach  in  four  studies  to  disentangle  the  shared 
and  unique  variance  explained  by  these  motivational  constructs  in 
predicting  the  most  commonly  investigated  beneficial  outcomes  in  the 
achievement  domain.  We  believe  that  this  approach  holds  consider¬ 
able  promise,  in  that  it  demonstrates  how  achievement  goals  fit  in  a 
broader  theory  of  achievement  motivation. 

Mastery  Goals  as  a  Predictor  of  Beneficial  Outcomes 

Achievement  goals  are  social- cognitive  mental  foci  that  direct 
individuals’  responses  in  competence-relevant  situations  (Elliot, 
1999).  Achievement  goal  researchers  focus  primarily  on  two 
types  of  competence-based  goals,  crossed  by  the  approach- 
avoidance  distinction  (for  a  historical  review,  see  Elliot,  2005). 
Mastery-focused  individuals  use  a  task-  or  self-referenced  standard 
in  competence  evaluation,  whereas  performance-focused  individ¬ 
uals  use  an  other-referenced  standard.  Both  mastery  and  perfor¬ 
mance  goals  involve  striving  to  approach  competence  or  avoid 
incompetence,  resulting  in  a  2  X  2  model  of  achievement  goals: 
mastery-approach,  mastery-avoidance,  performance-approach,  and 
performance-avoidance. 

In  the  literature,  mastery-approach  goals  are  primarily  linked  to  a 
pattern  of  adaptive  outcomes,  performance-approach  goals  to  a  mixed 
pattern  of  adaptive  and  maladaptive  outcomes,  and  the  two  avoidance 
goals  to  varied  patterns  of  maladaptive  outcomes  (for  meta-analyses, 
see  Baranik,  Stanley,  Bynum,  &  Lance,  2010;  Eluang,  2011,  2016; 
Hulleman,  Schrager,  Bodmann,  &  Harackiewicz,  2010;  Van  Yperen, 
Blaga,  &  Postmes,  2014,  2015).  In  the  present  research,  we  are 
interested  in  separating  the  influence  of  achievement  goals  from  the 
influence  of  reasons  when  predicting  beneficial  achievement-relevant 
outcomes.  It  is  therefore  critical  to  select  goals  and  reasons  that  are 
clearly  adaptive  (and  whose  beneficial  influences  are  comparable  in 
nature  and  scope).  Accordingly,  our  primary  focus  is  on  mastery- 
approach  goals  (i.e.,  mastering  a  task,  improving  over  time;  hereafter 
referred  to  as  mastery  goals),  although  in  our  final  study  we  extend  the 
focus  to  performance-approach  goals  (i.e.,  outperforming  others; 
hereafter  referred  to  as  performance  goals). 

Two  types  of  adaptive  achievement-relevant  outcomes  are  reliably 
associated  with  mastery  goals.  First,  mastery  goals  are  positively 
related  to  beneficial  experiential  outcomes,  that  is,  positive  affective 
and  phenomenological  responses  to  achievement  tasks  (Harackie¬ 
wicz,  Barron,  Carter,  Lehto,  &  Elliot,  1997:  Pekrun,  2006).  Mastery 
goals  are  thought  to  direct  attention  to  the  achievement  activity  itself 
and  increase  appraisals  of  task  controllability  and  self-efficacy, 
thereby  facilitating  the  positive  subjective  value  of  the  task  (Dweck, 
1999;  Kaplan  &  Maehr,  2007;  Pekrun,  Elliot,  &  Maier,  2006).  For 
instance,  in  the  workplace,  mastery  goals  have  been  shown  to  posi¬ 
tively  predict  job  interest  (Retelsdorf,  Butler,  Streblow,  &  Schiefele, 
2010),  job  satisfaction  (Janssen  &  Van  Yperen,  2004),  and  job  posi¬ 
tive  emotion  (Fisher,  Minbashian,  Beckmann,  &  Wood,  2013).  Sec¬ 
ond,  mastery  goals  are  positively  related  to  beneficial  s elf-regulated 
learning  outcomes,  that  is,  metacognitive,  strategic,  proactive  re¬ 
sponses  to  achievement  tasks  (Pintrich,  1999;  Zimmerman,  1989). 
Mastery  goals  require  the  attainment  of  task-focused  and  intrapersonal 


standards,  which  promote  a  fully  engaged  approach  to  learning  and 
full  effort  expenditure  (Meece,  Anderman,  &  Anderman,  2006;  Nich- 
olls,  1989;  Senko,  Hama,  &  Belmonte,  2013).  As  such,  mastery  goals 
have  been  shown  to  positively  predict  deep-processing  (Diseth,  2011), 
interpersonal  help-seeking  behavior  (Karabenick,  2004),  a  preference 
for  challenging  tasks  (Ames  &  Archer,  1988),  and  task  persistence 
(Sideridis  &  Kaplan,  2011). 

Autonomous  Reasons  as  a  Predictor  of 
Beneficial  Outcomes 

SDT  is  a  theory  of  motivation  that  highlights  the  importance  of 
underlying  reasons  for  behavior,  including  goal-directed  behavior 
(Deci  &  Ryan,  2000;  Sheldon,  2004).  The  theory  distinguishes  be¬ 
tween  two  primary  types  of  reasons  for  goal  pursuit.  Autonomous 
reasons  include  pursuing  goals  because  they  are  fun  or  enjoyable 
(intrinsic  regulation),  or  because  one  identifies  with  them  as  important 
or  meaningful  (identified  regulation);  controlled  reasons  include  pur¬ 
suing  goals  because  they  enable  one  to  bolster  the  ego  or  avoid  feeling 
shame  (introjected  regulation),  or  because  they  allow  one  to  obtain  a 
reward  (external  regulation;  Deci  &  Ryan,  2000).  In  the  literature, 
autonomous  reasons  are  most  commonly  predictors  of  beneficial 
outcomes,  whereas  controlled  reasons  are  most  commonly  predictors 
of  detrimental  outcomes  (Ratelle,  Guay,  Vallerand,  Larose,  &  Sene- 
cal,  2007).  Accordingly,  our  primary  focus  is  on  autonomous  reasons 
(although  in  all  of  our  studies  we  assessed  and  controlled  for  con¬ 
trolled  reasons,  as  well). 

Autonomous  reasons  for  goal  pursuit  are  associated  with  the  same 
beneficial  outcomes  as  those  reviewed  above  for  mastery  goals  (for  a 
review,  see  Ryan  &  Deci,  2006).  First,  autonomous  reasons  are 
positively  related  to  beneficial  experiential  outcomes,  because  they 
involve  acting  in  a  more  volitional  way,  thereby  making  the  activity 
more  enjoyable  and  immersive  (Vansteenkiste,  Lens,  et  ah,  2014).  For 
instance,  in  the  workplace,  autonomous  reasons  have  been  shown  to 
positively  predict  job  interest  (Gagne  &  Deci,  2005),  job  satisfaction 
(Lam  &  Gurland,  2008),  and  job  positive  emotion  (Gagne  et  al., 
2010).  Second,  autonomous  reasons  are  positively  related  to  benefi¬ 
cial  s elf-regulated  learning  outcomes,  because  goal  pursuit  is  viewed 
as  a  positive  challenge,  providing  a  meaningful  impetus  for  effort 
expenditure  and  personal  growth  (Deci,  Vallerand,  Pelletier,  &  Ryan, 
1991).  Specifically,  empirical  work  has  shown  that  these  reasons 
positively  predict  deep  learning  strategy  (Vansteenkiste,  Zhou,  Lens, 
&  Soenens,  2005),  interpersonal  help-seeking  behavior  (Skaalvik  & 
Skaalvik,  2013),  a  preference  for  challenge  (Standage,  Duda,  &  Ntou- 
manis,  2005),  and  persistence  (Vallerand,  Fortier,  &  Guay,  1997). 

Combining  Mastery  Goals  and  Autonomous  Reasons 
as  Predictors  of  Beneficial  Outcomes 

Any  given  achievement  goal  may  t|e  adopted  for  a  variety  of 
reasons.  These  reasons  may  vary  from  competence-relevant  (e.g.,  to 
succeed  at  university;  Dompnier  et  al.,  2009)  to  not  competence¬ 
relevant  (e.g.,  to  gain  respect  from  others;  Urdan  &  Mestas,  2006), 
and  from  intrapersonally  evoked  (e.g.,  a  desire  to  experience  pride; 
Urdan,  2004a)  to  environmentally  evoked  (e.g.,  a  teacher  demand; 
Wolters,  2004).  Recently,  researchers  have  shown  an  interest  in  con¬ 
ceptualizing  these  reasons  using  SDT  (see  Vansteenkiste  &  Moura- 
tidis,  2016).  Vansteenkiste,  Mouratidis,  and  Lens  (2010)  were  the  first 
to  publish  empirical  work  relying  on  such  a  conceptualization.  Soccer 
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piayers  first  reported  their  performance  goals  (e.g.,  “It  is  my  goal  to 
perform  better  than  my  direct  opponent”);  then,  they  reported  the 
autonomous  and  controlled  reasons  connected  to  their  performance 
goals  (e.g.,  “[It  is  my  goal  to  perform  better  than  my  direct  opponent] 
because  this  goal  is  a  challenge  to  me,”  pp.  223-230).  The  relations 
between  performance  goals  and  beneficial  experiential  outcomes  were 
found  to  drop  to  nonsignificance  (e.g.,  for  positive  emotion)  or  con¬ 
siderably  (e.g.,  for  subjective  vitality)  when  controlling  for  the  posi¬ 
tive  influence  of  the  autonomous  reasons  connected  to  performance 
goals  (for  comparable  results  in  educational  settings,  see  Gillet, 
Lafreniere,  Vallerand,  Huart,  &  Fouquereau,  2014;  Vansteenkiste, 
Smeets,  et  al.,  2010). 

Gillet,  Lafreniere,  Huyghebaert,  and  Fouquereau  (2015)  used 
this  same  approach  to  study  the  SDT-derived  reasons  connected  to 
mastery  goals.  Workers  first  reported  their  mastery  goals,  and  then 
they  reported  the  autonomous  and  controlled  reasons  connected  to 
their  mastery  goals  (e.g.,  “[My  goal  is  to  improve]  because  of  the 
fun  and  enjoyment  that  it  provides  me,”  p.  862).  The  relations 
between  mastery  goals  and  beneficial  experiential  (e.g.,  positive 
emotion)  and  self-regulated  learning  (e.g.,  engagement)  outcomes 
dropped  to  nonsignificance  when  controlling  for  the  positive  in¬ 
fluence  of  the  autonomous  reasons  connected  to  mastery  goals  (see 
also  Gaudreau  &  Braaten,  2016;  for  related  research  with  dominant 
achievement  goals,  see  Michou,  Vansteenkiste,  Mouratidis,  & 
Lens,  2014;  Ozdemir  Oz,  Lane,  &  Michou,  2015;  Vansteenkiste, 
Mouratidis,  van  Riet,  &  Lens,  2014). 

In  interpreting  these  results,  researchers  commonly  state  that 
their  methodology  has  enabled  them  to  detach  reasons  from  goals, 
and  that  the  autonomous  reasons  connected  to  the  achievement 
goals  are  stronger  (Gillet  et  al.,  2015),  more  robust  (Vansteenkiste, 
Mouratidis,  et  al.,  2010),  and  more  important  (Deci  &  Ryan,  2016) 
predictors  of  beneficial  outcomes  than  the  achievement  goals  per 
se.  We  do  not  agree  with  these  interpretations  (see  also  Vansteen¬ 
kiste,  Mouratidis,  et  al.,  2014,  for  a  more  nuanced  view).  We 
believe  that  the  reason-based  variable  focused  on  in  the  extant 
work  is  best  represented  as  an  achievement  goal  complex.  An 
achievement  goal  complex  is  a  composite  motivational  construct, 
comprised  of  an  achievement  goal  combined  with  information 
regarding  the  reason  for  pursuing  the  goal  (Elliot  &  Thrash,  2001). 
The  structural  form  of  an  achievement  goal  complex  is 
“ACHIEVEMENT  GOAL  because  REASON,”  which  is  the  typ¬ 
ical  form  of  the  reason-based  variables  used  in  the  aforementioned 
research,  for  example  “MY  GOAL  IS  TO  IMPROVE  because  OF 
THE  FUN  AND  ENJOYMENT  THAT  IT  PROVIDES  ME”. 

The  consequence  of  such  a  reinterpretation  is  twofold.  First,  in 
the  approach  used  to  date,  autonomous  and  controlled  reasons  have 
only  been  operationalized  with  reference  to  the  specific,  focal 
achievement  goal;  there  has  been  no  assessment  of  reasons  in  and 
of  themselves,  separate  from  the  focal  achievement  goal.  Thus, 
from  our  perspective,  the  results  of  the  existing  research  actually 
indicate  that  autonomous  achievement  goal  complexes  eliminate 
or  reduce  the  influence  of  achievement  goals  per  se,  not  that 
autonomous  reasons  in  and  of  themselves  eliminate  or  reduce  the 
influence  of  achievement  goals  per  se.  Second,  it  is  important  to 
bear  in  mind  that  in  the  approach  used  to  date  there  is  redundancy 
in  the  measurement  of  achievement  goals:  The  achievement  goal  is 
assessed  multiple  times,  both  alone  as  a  focal  goal  and  in  the 
reason-based  variable  that  connects  the  goal  with  reasons  (see 
Senko  &  Tropiano,  2016,  for  a  related  point).  Thus,  it  should  not 


be  surprising  that  autonomous  achievement  goal  complexes  elim¬ 
inate  or  reduce  the  influence  of  achievement  goals  per  se,  because 
the  two  variables  have  overlapping  content.  In  the  following,  we 
seek  to  clarify  and  extend  the  existing  research  by  proposing  a 
systematic  approach  to  studying  achievement  goals,  reasons  for 
goal  pursuit,  and  specific  achievement  goal  complexes. 

A  Systematic  Approach  to  Studying  Goals,  Reasons, 
and  Goal  Complexes 

Goal  complexes  are  multicomponent  constructs.  In  studying 
them,  it  is  important  to  carefully  distinguish  between  their  com¬ 
ponent  parts  and  to  design  assessments  accordingly.  A  first  com¬ 
ponent  is  the  focal  goal  that  represents  an  aim  per  se  without  any 
accompanying  reason.  In  measurement,  it  is  critical  to  use  a  “pure 
goal”  assessment  uncontaminated  by  reason  content  (e.g.,  for 
mastery  goals:  “My  goal  is  to  learn;”  see  Elliot  &  Murayama, 
2008,  on  this  contamination  issue).  A  second  component  is  the  focal 
reason  that  represents  a  more  general  form  of  motivation  without  any 
specific  aim.  In  measurement,  it  is  critical  to  also  use  a  “pure  reason” 
assessment  uncontaminated  by  specific  goal  content  (e.g.,  for  auton¬ 
omous  reasons:  “I  pursue  goals  because  I  find  them  challenging”).1 
Combining  the  pure  goal  with  the  pure  reason  creates  a  third  con¬ 
struct,  the  integrated  goal  complex.  It  represents  an  instrumental 
relation  between  the  goal  and  the  reason:  The  goal  serves  the  reason 
and  the  reason  provides  the  impetus  for  goal  adoption  and  pursuit.  In 
measurement,  this  functional  relation  is  explicitly  expressed  (e.g.,  for 
the  autonomous  mastery  goal  complex:  “My  goal  is  to  learn  because 
I  find  this  a  highly  challenging  goal”).2 

Once  these  three  constructs — goal,  reason,  and  goal  complex — 
are  separately  assessed,  they  may  be  used  in  three  sets  of  analyses. 
First,  goals  and  reasons  may  be  tested  separately  to  determine  their 


1  In  the  literature,  SDT-derived  reason  assessments  are  often  tied  to  a 
generic  goal-directed  behavior  (e.g.,  “I  work  because  it  is  fun;”  Gagne  & 
Deci,  2005,  p.  334).  However,  goal  complex  assessments  are  not  tied  to  a 
behavior,  but  to  a  particular  goal  (e.g.,  “In  my  work,  my  goal  is  to  learn 
because  I  find  it  fun”;  see  Vansteenkiste,  Lens,  et  al.,  2014).  When 
studying  goal  complexes,  as  distinct  from  other  motivational  complexes 
(see  Murray,  1938),  it  is  critical  to  operationalize  reasons,  goals,  and  goal 
complexes  in  a  symmetrical  manner:  Each  motivational  construct  should 
be  measured  with  respect  to  the  same  reference  component.  Specifically,  in 
order  to  isolate  the  influence  of  reasons  from  the  influence  of  goals  and 
goal  complexes,  SDT-derived  reason  assessments  need  to  be  stripped  of 
behavioral  elements  and  tied  to  goal  regulation  in  general  (e.g.,  “In  my 
work,  I  pursue  goals  because  I  find  them  fun;”  for  such  an  operationaliza¬ 
tion,  see  Sheldon  &  Elliot,  1998). 

2  In  past  research,  an  achievement  goal  complex  was  sometimes  opera¬ 
tionalized  as  the  product  term  between  an  achievement  goal  and  a  reason 
variable  (e.g.,  Gaudreau,  2012;  for  experimental  work,  see  Benita,  Roth,  & 
Deci,  2014;  Spray,  John  Wang,  Biddle,  &  Chatzisarantis,  2006).  In  our 
approach,  however,  the  product  term  between  the  “pure  mastery  goal” 
variable  and  the  “pure  autonomous  reason”  variable  would  not  correspond 
to  an  autonomous  mastery  goal  complex.  “Pure  mastery  goals”  may  be 
energized  by  reasons  other  than  autonomous  reasons  (e.g.,  controlled 
reasons),  whereas  “pure  autonomous  reasons”  may  be  directed  by  goals 
other  than  mastery  goals  (e.g.,  performance  goals),  therefore  the  interaction 
between  mastery  goals  and  autonomous  reasons  does  not  necessarily 
represent  an  autonomous  mastery  goal  complex.  In  other  words,  high 
mastery  goals  and  high  autonomous  reasons  do  not  always  indicate  a  high 
autonomous  mastery  goal  complex,  and  a  third  composite  variable  is 
needed  to  capture  the  extent  to  which  these  goals  and  reasons  combine  to 
form  a  single,  inseparable,  and  additional  achievement  goal  complex 
variable. 
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individual  links  to  outcomes.  Second,  goals  and  reasons  may  be 
tested  simultaneously  to  determine  their  unique  links  to  outcomes. 
Third,  goal  complexes  may  be  tested  together  with  goals  and 
reasons  to  determine  the  incremental  contribution  of  goal  com¬ 
plexes  to  outcomes,  as  well  as  the  contribution  of  goals  per  se  and 
reasons  per  se.  In  the  following,  we  apply  this  approach  to  the 
central  constructs  studied  in  our  research  herein:  mastery  goals, 
autonomous  reasons,  and  autonomous  mastery  goal  complexes. 

Testing  Mastery  Goals  and  Autonomous  Reasons  as 
Separate  Predictors 

As  reviewed  earlier,  mastery  goals  and  autonomous  reasons  have 
been  shown  to  similarly  predict  beneficial  achievement-relevant  out¬ 
comes.  We  expected  to  find  the  same  predictive  patterns  for  mastery 
goals  and  autonomous  reasons  as  that  found  in  prior  work. 

Hypothesis  1:  Mastery  goals  (HI a)  and  autonomous  reasons 
(Hlb)  are  positive  predictors  of  beneficial  experiential  and 
self-regulated  learning  outcomes. 

Testing  Mastery  Goals  and  Autonomous  Reasons  as 
Simultaneous  Predictors 

Mastery  goals  and  autonomous  reasons  are  both  distinct  and  over¬ 
lapping  constructs.  They  are  conceptually  distinct  in  that  they  have 
unique  properties,  operate  at  different  levels  of  specificity,  and  have 
different  functions.  Mastery  goals  are  concrete  cognitive  representa¬ 
tions  of  future  competence-relevant  possibilities  that  proximally  direct 
individuals’  behavior  (Elliot  &  Fryer,  2008).  Autonomous  reasons  are 
general  need-based  internal  forces  that  provide  energy  for  action 
(Deci  &  Ryan,  2008).  Furthermore,  principal  component  factor  anal¬ 
ysis  has  revealed  that  mastery  goal  and  autonomous  reason  items 
loaded  on  different  factors  (Dysvik  &  Kuvaas,  2010).  Given  their 
conceptual  and  empirical  distinctiveness,  we  expected  mastery  goals 
and  autonomous  reasons  to  explain  independent  variance  in  the  ben¬ 
eficial  experiential  and  self-regulated  learning  outcomes  to  which  they 
are  (separately)  linked. 

Hypothesis  2:  Mastery  goals  (H2a)  and  autonomous  reasons 
(H2b)  explain  independent  variance  in  beneficial  experiential 
and  self-regulated  learning  outcomes. 

Although  they  are  conceptually  and  empirically  distinct,  mas¬ 
tery  goals  and  autonomous  reasons  are  also  overlapping  constructs. 
Mastery  goals  are  sometimes  described  as  intrinsic  goals  (Pintrich 
&  Garcia,  1991)  and  emerge  from  autonomy-supportive  contexts 
(Diseth  &  Samdal,  2014);  autonomous  reasons  are  viewed  as 
facilitating  the  expression  of  one’s  agentic  tendency  to  learn  (Ryan 
&  Powelson,  1991)  and  emerge  from  mastery-focused  climates 
(Standage  et  al.,  2005).  Furthermore,  a  positive  correlation  is 
commonly  observed  between  mastery  goals  and  autonomous  rea¬ 
sons  (e.g.,  Katz,  Assor,  &  Kanat-Maymon,  2008).  Given  this 
conceptual  and  empirical  overlap,  the  predictive  utility  of  mastery 
goals  should  be  diminished  when  partialing  out  the  variance  ex¬ 
plained  by  autonomous  reasons — this  is  consistent  with  the  posi¬ 
tion  articulated  in  the  extant  research  on  SDT-derived  reasons  and 
achievement  goals,  but  has  not  yet  been  tested.  Conversely,  the 
predictive  utility  of  autonomous  reasons  should  also  be  diminished 


when  partialing  out  the  variance  explained  by  mastery  goals — this 
also  has  not  been  tested  in  the  extant  research. 

Hypotheses  3:  The  predictive  strength  of  mastery  goals  is 
diminished  when  controlling  for  autonomous  reasons  (H3a), 
and  the  predictive  strength  of  autonomous  reasons  is  dimin¬ 
ished  when  controlling  for  mastery  goals  (H3b). 

Testing  Autonomous  Mastery  Goal  Complexes 
Together  With  Goals  and  Reasons 

According  to  gestalt  principles,  a  goal  complex  should  be  more 
than  the  mere  sum  of  a  goal  and  a  reason  (Lewin,  1951).  That  is, 
autonomous  reasons  combined  with  a  mastery  goal  should  do  more 
than  just  add  an  exogenous  reason  element  to  the  goal,  they  should 
alter  the  functional  significance  of  the  goal  and  the  experience  of 
goal  regulation  (Deci  &  Ryan,  1985;  Elliot,  2006).  Both  mastery 
goals  and  autonomous  reasons  are  commonly  portrayed  as  optimal 
forms  of  motivation  (Kaplan  &  Maehr,  2007;  Sheldon,  2004),  and 
it  is  likely  that  their  integration  in  the  form  of  an  achievement  goal 
complex  would  be  particularly  beneficial  for  achievement-relevant 
outcomes.  Autonomous  reasons  may  enhance  mastery  goal  persis¬ 
tence  and  attainment  via  challenge  appraisals  (Ntoumanis  et  al., 
2014),  and  mastery  goals  may  help  maintain  a  focus  on  the  positive 
value  of  the  task  and  facilitate  interest-based  engagement  (Huang, 
2011;  Senko  &  Miles,  2008).  In  other  words,  autonomous  reasons 
are  assumed  to  predict  goal  success  (i.e.,  effective  goal  regulation), 
and  when  specifically  combined  with  mastery  goals,  goal  success 
is  assumed  to  further  lead  to  beneficial  experiential  and  self- 
regulated  learning  outcomes  (i.e.,  effective  behavior  regulation). 
This  would  be  consistent  with  the  findings  observed  in  the  extant 
research  on  SDT-derived  reasons  and  achievement  goals,  although 
in  that  work  autonomous  reasons  in  and  of  themselves  were  not 
accounted  for. 

Hypotheses  4:  The  autonomous  mastery  goal  complex  ex¬ 
plains  incremental  variance  in  beneficial  experiential  and  self- 
regulated  learning  outcomes. 

As  noted  above,  there  is  measurement  redundancy  when 
achievement  goal  complexes  and  their  component  parts  are  as¬ 
sessed.  As  such,  the  predictive  utility  of  mastery  goals  should  be 
diminished  when  examining  the  autonomous  mastery  goal  com¬ 
plex— this  is  how  we  interpret  the  findings  in  the  extant  research 
on  SDT-derived  reasons  and  achievement  goals.  Likewise,  given 
the  measurement  redundancy  with  regard  to  autonomous  reasons, 
the  predictive  utility  of  autonomous  reasons  should  be  diminished 
when  examining  the  autonomous  mastery  goal  complex — this  has 
not  been  considered  in  the  extant  research. 

Hypotheses  5:  The  predictive  strength  of  mastery  goals  (H5a) 
and  autonomous  reasons  (H5b)  is  diminished  when  control¬ 
ling  for  the  autonomous  mastery  goal  complex. 

Overview  of  the  Studies 

We  designed  four  studies  to  disentangle  the  influence  of 
achievement  goals  (especially  mastery  goals),  reasons  (especially 
autonomous  reasons),  and  achievement  goal  complexes  (especially 
the  autonomous  mastery  goal  complex)  on  the  most  commonly 
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investigated  beneficial  experiential  and  self-regulated  learning  out¬ 
comes.  In  Study  1,  we  tested  Hypotheses  la- lb,  2a-2b,  and  3a-3b 
(detaching  goals  from  reasons);  in  Studies  2  to  4,  we  additionally 
tested  Hypotheses  4  and  5a— 5b  (detaching  goal  complexes  from 
goals  and  reasons).  In  Studies  1  and  2,  we  assessed  beneficial 
experiential  outcomes  (i.e.,  interest,  satisfaction,  positive  emo¬ 
tion);  in  Studies  3  and  4,  we  assessed  beneficial  self-regulated 
learning  outcomes  (i.e.,  deep  learning,  help-seeking,  challenging 
tasks,  persistence).  In  Studies  1  to  3,  we  focused  solely  on  the  goal 
variable  of  central  interest,  namely  mastery  goals;  in  Study  4,  we 
extended  the  hypotheses  to  performance  goals  and  performance 
goal-relevant  outcomes.  Studies  1  to  3  were  conducted  in  a  work 
setting;  Study  4  was  conducted  in  an  educational  setting  In  each 
study  we  also  assessed  controlled  reasons  (and  associated  con¬ 
trolled  achievement  goal  complexes).  Given  that  our  research 
focused  on  beneficial  outcomes  and  that  controlled  reasons  and 
controlled  goal  complexes  are  more  likely  to  be  predictors  of 
detrimental  outcomes,  no  predictions  were  made  for  these  vari¬ 
ables.  However,  as  in  prior  research,  these  variables  were  entered 
as  covariates  (e.g.,  Gillet  et  al„  2015)  and  the  influence  of  con¬ 
trolled  achievement  goal  complexes  will  be  addressed  in  the  Gen¬ 
eral  Discussion  section. 

Table  1  provides  a  summary  and  guide  for  the  research;  it  states 
each  hypothesis,  its  rationale,  its  operationalized  predictor(s),  and 
the  studies  and  outcomes  to  which  it  relates.  In  all  studies,  sample 
sizes  were  determined  a  priori,  and  all  manipulations,  data  exclu¬ 
sions,  and  measures  analyzed  are  reported.  Questionnaires,  raw 
data,  and  syntax  files  for  the  four  studies  are  available  through 
FigShare  (https://figshare.eom/s/18543835e916a359b33e). 

Study  1.  Mastery  Goals,  Reasons,  and 
Experiential  Outcomes 

Study  1  was  designed  to  test  mastery  goals  and  SDT-derived 
reasons  as  predictors  of  three  experiential  outcomes.  Participants 
reported  their  work-based  mastery  goals,  and  their  autonomous 
and  controlled  reasons  for  goal  pursuit.  Participants  also  reported 
their  job  interest,  satisfaction,  and  positive  emotion;  we  assessed 
these  variables  with  measures  used  in  prior  work  in  this  area 
(Gillet  et  al.,  2015,  2014;  Ozdemir  Oz  et  al.,  2015). 

Method 

Participants.  Amazon  Mechanical  Turk  (MTurk)  was  used  as 
the  crowdsourcing  platform  for  data  collection.  MTurk  workers  are 
more  demographically  diverse  than  standard  Internet  samples  and 
American  undergraduate  samples  (Buhrmester,  Kwang,  &  Gos¬ 
ling,  2011).  An  a  priori  power  analysis  revealed  that  395  partici¬ 
pants  were  needed  to  detect  small-sized  effects  (f1  =  .02)  in  a 
multiple  linear  regression  model  with  power  of  .80.  We  over¬ 
sampled  to  make  sure  that  we  exceeded  our  target  sample  size  after 
excluding  missing  data.  To  participate,  MTurk  workers  had  to 
currently  have  a  job.  A  total  of  467  participants  completed  the 
questionnaire;  seven  were  excluded  a  priori  due  to  missing  data  on 
the  outcome  variables.  The  final  sample  consisted  of  460  U.S. 
residents,  278  men  and  181  women  (one  not  reported),  with  a 
mean  age  of  32.18  (SD  =  9.04),  and  having  held  their  job  for  6.03 
years  (SD  =  5.70).  Individuals  received  0.20  USD  for  participat¬ 
ing.3 


Procedure.  Participants  stated  their  current  job  and  reported 
their  work-based  mastery  goals  and  reasons  for  goal  pursuit.  The 
goal  and  reason  variables  were  counterbalanced:  249  participants 
completed  the  reason  items  first,  211  completed  the  goal  items 
first.  Then,  job  interest,  satisfaction,  and  positive  emotion  were 
assessed. 

Measures.  Table  2  presents  the  descriptive  statistics  and  cor¬ 
relation  matrix.  Participants  responded  using  a  1  =  not  at  all,  4  = 
somewhat,  7  =  completely  scale. 

Mastery  goals.  Elliot  and  Murayama’s  (2008)  Achievement 
Goal  Questionnaire — Revised  (AGQ-R)  was  adapted  to  assess 
work-based  mastery  goals.  The  three  items  were  presented  as 
“descriptions  of  how  [one]  might  pursue  goals  at  [his/her]  job” 
(e.g.,  “In  my  job,  my  goal  is  to  learn  as  much  as  possible”). 

Autonomous  and  controlled  reasons  for  goal  pursuit. 
Michou  et  al.  (2014)  measure  was  adapted  to  assess  work-based 
autonomous  and  controlled  reasons  for  goal  pursuit.  To  disentan¬ 
gle  the  goal  component  from  the  reason  component,  we  adjusted 
these  items  so  that  they  did  not  refer  to  a  specific  achievement 
goal.  The  items  were  presented  as  “explanations  for  why  [one] 
might  pursue  goals  at  [his/her]  job.”  Two  items  assessed  autono¬ 
mous  reasons  (e.g.,  “In  my  job,  I  pursue  goals  because  I  find  them 
highly  stimulating  and  challenging”)  and  four  items  assessed  con¬ 
trolled  reasons  (e.g.,  “In  my  job,  I  pursue  goals  because  others  will 
reward  me  only  if  I  achieve  these  goals”). 

Job  interest.  Ryan’s  (1982)  six-item  Intrinsic  Motivation  In¬ 
ventory  was  adapted  to  assess  job  interest  (e.g.,  “I  would  describe 
my  work  as  very  interesting”). 

Job  satisfaction.  Diener,  Emmons,  Larsen,  and  Griffin’s 
(1985)  five-item  Satisfaction  with  Life  Scale  was  adapted  to  assess 
job  satisfaction  (e.g.,  “I  am  satisfied  with  my  work”). 

Job  positive  emotion.  Watson,  Clark,  and  Tellegen’s  (1988) 
Positive  and  Negative  Affect  Schedule  was  adapted  to  assess  job 
positive  emotion.  Participants  were  asked  to  indicate  the  extent 
they  feel  10  positive  emotions  in  their  work  (e.g.,  “excited,” 
“proud”). 

Results 

Overview.  We  used  sequential  linear  regression  for  our  anal¬ 
yses.  For  each  outcome  variable,  three  models  were  built.  First,  in 
the  “goal-only”  model,  only  mastery  goals  were  included  as  a 
predictor  (Model  1  in  Table  3).  Second,  in  the  “reason-only” 
model,  only  autonomous  and  controlled  reasons  were  included  as 
predictors  (Model  2  in  Table  3).  Third,  in  the  “goal-and-reason” 
model,  mastery  goals  and  autonomous  and  controlled  reasons  were 
included  as  predictors  (Model  3  in  Table  3).  This  enabled  us  to 
estimate  the  independent  contribution  of  the  two  focal  variables — 
mastery  goals  and  autonomous  reasons — as  well  as  the  reduction 
of  their  predictive  strength  when  partialing  out  the  variance  ac¬ 
counted  for  by  the  other  variable. 

Preliminary  analysis.  We  conducted  a  preliminary  analysis 
to  examine  potential  covariates:  sex  (“1”  =  male,  “2”  =  female, 
for  all  studies),  age,  and  seniority.  In  addition,  we  tested  the 


3  For  this  and  the  subsequent  studies,  the  payment  was  way  well  above 
the  reservation  wage  of  $1.38  per  hour  (i.e.,  the  minimum  wage  a  worker 
is  willing  to  accept  to  complete  a  task;  Horton  &  Chilton,  2010).  Payment 
level  has  been  found  not  to  affect  data  quality  (Buhrmester  et  al.,  2011). 
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Table  1 

Summary  of  the  Hypotheses,  Their  Rationale,  Their  Operationalized  Predictors,  and  the  Studies  and  Outcomes  to  Which  They  Relate 


Hypotheses 

Rationale 

Predictors  and 
“operationalization” 

Studies:  Types  of  outcome 

HI  a.  Mastery  goals  are  a 

positive  predictor  of  beneficial 
outcomes 

Replication  of  prior  research 

Mastery  goals  alone  “My  goal 
is  to  learn” 

S 1—2:  Experiential  S3-4:  Self-regulated 
learning  S4:  Extended  to 
performance  goals 

Hlb.  Autonomous  reasons  are  a 
positive  predictor  of  beneficial 
outcomes 

Replication  of  prior  research 

Autonomous  reasons  alone  “I 
pursue  goals  because  1  find 
them  challenging” 

S 1—2:  Experiential  S3M:  Self-regulated 
learning 

N 

H2a-b.  Mastery  goals  (H2a)  and 

Mastery  goals  and  autonomous 

Mastery  goals  plus  autonomous 

Sl-2:  Experiential  S3-4:  Self-regulated 

autonomous  reasons  (H2b) 
explain  independent  variance 
in  beneficial  outcomes 

H3a-b.  The  influence  of  mastery 
goals  is  diminished  when 
controlling  for  autonomous 
reasons  (H3a),  and  vice  versa 
(H3b) 

reasons  differ 

Mastery  goals  and  autonomous 
reasons  overlap 

reasons 

learning  S4:  Extended  to 
performance  goals 

Sl-2:  Experiential  S3M-:  Self-regulated 
learning 

H4.  The  autonomous  mastery 

The  autonomous  mastery  goal 

Mastery  goals  plus  autonomous 

S2:  Experiential  S3M:  Self-regulated 

goal  complex  explains 

complex  is  more  than  the 

reasons  plus  autonomous 

learning  S4:  Extended  to 

incremental  variance  in 
beneficial  outcomes 

mere  sum  of  goal  and 
reason 

mastery  goal  complex  “My 
goal  is  to  learn  because  I 

performance  goals 

H5a-b.  The  influence  of  mastery 
goals  (H5a)  and  autonomous 
reasons  (H5b)  is  diminished 
when  controlling  for  the 
autonomous  mastery  goal 
complex 

Measurement  redundancy 

find  this  a  highly  challenging 
goal” 

S2:  Experiential  S3-4:  Self-regulated 
learning  S4:  Extended  to 
performance  goals 

interactions  between  order  (“1”  =  reasons  first,  “2”  =  goals 
first,  for  all  studies)  and  our  predictor  variables  (i.e.,  mastery 
goals  and  autonomous  and  controlled  reasons;  see  Yzerbyt, 
Muller,  &  Judd,  2004).  None  of  the  covariates  attained  signif¬ 
icance  (ps  >  .088),  and  neither  order  main  nor  interactive 
effects  were  observed  (ps  &  .152).  Hence  these  terms  were  not 
considered  further  (including  them  did  not  change  the  pattern  of 
results). 

Main  analyses.  For  this  and  all  subsequent  studies,  our  report 
of  the  results  is  hypothesis  driven.  Nontheoretically  relevant  find¬ 
ings  are  not  reported  in  the  narrative,  but  are  included  in  Table  3 
(which  presents  the  full  set  of  results).  Effect  size  estimates  are 
also  included  in  the  tables.  These  estimates  are  partial  eta  squared 
(Tip),  that  is,  the  proportion  of  variance  uniquely  explained  by  a 


predictor  (i.e.,  while  partialing  out  the  effect  of  the  other  predic¬ 
tors). 

“Goal-only”  model.  In  line  with  Hypothesis  la,  mastery 
goals  were  a  positive  predictor  of  interest,  B  =  0.62  [0.53,  0.71], 
p  <  .001,  satisfaction,  B  =  0.52  [0.42,  0.63],  p  <  .001,  and 
positive  emotion,  B  —  0.57  [0.49,  0.67],  p  <  .001  (numbers  in 
brackets  represents  95%  confidence  intervals). 

“Reason-only”  model.  In  line  with  Hypothesis  lb,  autono¬ 
mous  reasons  were  a  positive  predictor  of  interest,  B  =  0.66  [0.59, 
0.73],  p  <  .001,  satisfaction,  B  =  0.62  [0.54,  0.70],  p  <  .001,  and 
positive  emotion,  B  =  0.58  [0.51,  0.64],  p  <  .001. 

“Goal-and-reason”  model.  In  line  with  Hypothesis  2a,  mas¬ 
tery  goals  remained  a  positive  predictor  of  interest,  B  =  0.26  [0.16, 
0.36],  p  <  .001,  and  positive  emotion,  B  =  0.20  [0.10,  0.30],  p  < 


Table  2 


Studies  1  and  2:  Descriptive  Statistics  and  Correlation  Matrix  for  the  Main  Variables 


Descriptive  statistics 
(Study  1 /Study  2) 

Correlation  matrix 

(Study  1  below  the  diagonal,  Study  2  above  the  diagonal). 

a 

M 

SD 

(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

(8) 

Mastery  goals  (1) 

.87/.84 

5.84/5.85 

1.13/1.05 

_ 

.65*** 

.32*** 

73*** 

- 5 - 

.47*** 

.58*** 

.49*** 

.58*** 

Autonomous  reasons  (2) 

.86/.  80 

5.33/5.51 

1.38/1.21 

.60*** 

— 

.28*** 

.81*** 

.37*** 

.64*** 

.67*** 

.67*** 

Controlled  reasons  (3) 

.65/.70 

4.85/4.96 

1.14/1.19 

.28*** 

— 

.30*** 

.83*** 

.07 

7Q*** 

2^*** 

Autonomous  mastery  goal  complex  (4) 

n/a/.9 1 

n/a/5.48 

n/a/1.1 1 

n/a 

n/a 

n/a 

— 

42*** 

.62’** 

.60*** 

.66’** 

Controlled  mastery  goal  complex  (5) 

n/a/.91 

n/a/5.05 

n/a/ 1.13 

n/a 

n/a 

n/a 

n/a 

_ 

7  j  *** 

.36*** 

.38*** 

Job  interest  (6) 

.88/.84 

5.02/5.07 

1.31/1.22 

^  /|  *  *  * 

.68*** 

.11* 

n/a 

n/a 

_ 

7  ^  *** 

.68’** 

Job  satisfaction  (7) 

.91/.89 

4.91/5.12 

1.43/1.33 

41*** 

6 1  *#* 

n/a 

n/a 

74*** 

7 1  *** 

Job  positive  emotion  (8) 

,94/.94 

5.32/5.54 

1.26/1.16 

.52*** 

.66*** 

.26*** 

n/a 

n/a 

76*** 

— 

Note,  n/a  =  applicable  (i.e.,  the  variable  was  not  measured  in  the  study). 
><.05.  **><.001. 


Table  3 
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.001;  contrary  to  the  hypothesis,  mastery  goals  no  longer  predicted 
satisfaction,  B  =  0.09  [—0.02,  0.21],  p  =  .117.  In  line  with 
Hypothesis  2b,  autonomous  reasons  remained  a  positive  predictor 
of  interest,  B  =  0.54  [0.46,  0.62],  p  <  .001,  satisfaction,  B  =  0.58 
[0.48,  0.67],  p  <  .001,  and  positive  emotion,  B  =  0.49  [0.41, 0.56], 

p  <  .001. 

In  this  and  the  subsequent  studies,  we  used  the  Monte  Carlo 
method  (with  50,000  simulations)  to  estimate  the  confidence  in¬ 
tervals  for  reduction  of  the  predictive  strength  of  mastery  goals 
when  controlling  for  autonomous  reasons,  and  vice  versa  (Mac¬ 
Kinnon,  Lockwood,  &  Williams,  2004).  In  addition,  percentage 
reductions  in  the  effect  and  Sobel  tests  are  reported  in  parentheses 
(Z  tests  and  p  values).  In  line  with  Hypothesis  3a,  the  reduction  of 
the  relations  between  mastery  goals  and  interest,  B  =  0.38  [0.31, 
0.45]  (59%  reduction),  satisfaction,  B  =  0.40  [0.32,  0.42]  (81%), 
and  positive  emotion,  B  =  0.34  [0.27,  0.41]  (63%),  due  to  the 
inclusion  of  autonomous  reasons  were  significant  (Zs  s  9.30,  ps  < 
.001).  In  line  with  Hypothesis  3b,  the  reduction  of  the  relations 
between  autonomous  reasons  and  interest,  B  —  0.12  [0.07,  0.17] 
(18%),  and  positive  emotion,  B  =  0.09  [0.05,  0.14]  (16%),  due  to 
the  inclusion  of  mastery  goals  were  significant  (Zs  ^  3.96,  ps  < 
.001);  contrary  to  the  hypothesis,  the  reduction  of  the  relation 
between  autonomous  reasons  and  satisfaction,  B  =  0.04  [—0.01, 
0.10]  (7%),  was  not  significant  (Z  =  1.56,  p  =  .118). 

Discussion 

Mastery  goals  (Hypothesis  la)  and  autonomous  reasons  (Hy¬ 
pothesis  lb)  accounted  for  variance  in  interest,  satisfaction,  and 
positive  emotion  when  tested  separately.  More  importantly,  mas¬ 
tery  goals  (Hypothesis  2a)  and  autonomous  reasons  (Hypothesis 
2b)  each  explained  independent  variance  in  interest  and  positive 
emotion  when  tested  simultaneously.  Moreover,  the  predictive 
strength  of  mastery  goals  (Hypothesis  3a)  and  autonomous  reasons 
(Hypothesis  3b)  for  interest  and  positive  emotion  were  diminished 
when  taking  the  other  into  account.  This  suggests  that  neither 
construct  “captured”  all  of  the  variance  explained  by  the  other: 
Mastery  goals  and  autonomous  reasons  shared  predictive  utility 
with  regard  to  these  outcomes,  but  their  overlap  was  not  so 
substantial  as  to  conclude  that  one  eliminates  the  influence  of  the 
other.  For  satisfaction,  however,  Hypothesis  2a  and  3b  were  not 
supported.  Mastery  goals  no  longer  explained  a  significant  portion 
of  variance  in  satisfaction  when  autonomous  reasons  were  con¬ 
trolled,  and  controlling  for  mastery  goals  did  not  significantly 
diminish  the  influence  of  autonomous  reasons.  This  suggests  that 
for  at  least  some  outcomes,  the  influence  of  reasons  may  indeed 
outweigh  the  influence  of  goals. 

One  important  issue  that  Study  1  left  unaddressed  is  the  auton¬ 
omous  mastery  goal  complex.  Prior  goal  complex  research  has 
shown  (from  our  perspective)  that  controlling  for  the  autonomous 
mastery  goal  complex  leads  to  a  decrease  in  the  predictive  strength 
of  mastery  goals;  however,  it  has  not  tested  for  a  parallel  decrease 
in  the  predictive  strength  of  autonomous  reasons.  In  Study  2,  we 
unambiguously  separate  achievement  goals,  reasons,  and  achieve¬ 
ment  goal  complexes  in  order  to  test  whether  the  autonomous 
mastery  goal  complex  explains  incremental  variance  in  interest, 
satisfaction,  and  positive  emotion,  and  whether  it  diminishes  the 
predictive  strength  of  both  mastery  goals  and  autonomous  reasons. 


Study  2.  Mastery  Goals,  Reasons,  Goal  Complexes, 
and  Experiential  Outcomes 

Study  2  was  designed  to  test  mastery  goals,  SDT-derived  rea¬ 
sons,  and  achievement  goal  complexes  as  predictors  of  the  same 
experiential  outcomes  used  in  Study  1 .  Participants  reported  their 
work-based  mastery  goals,  their  autonomous  and  controlled  rea¬ 
sons  for  goal  pursuit,  and  their  autonomous  and  controlled  mastery 
goal  complexes.  Participants  also  reported  their  job  interest,  satis¬ 
faction,  and  positive  emotion. 

\ 

Method 

Participants.  The  target  sample  size  was  the  same  as  in  Study 
1 .  To  participate,  MTurk  workers  had  to  currently  have  a  job  and 
not  have  participated  in  Study  1.  A  total  of  407  participants 
completed  the  questionnaire;  one  was  excluded  a  priori  due  to 
missing  data  on  the  outcome  variables.  The  final  sample  consisted 
of  406  U.S.  residents,  236  men  and  170  women,  with  a  mean  age 
of  33.18  ( SD  =  10.07),  and  having  held  their  job  for  6.36  years 
( SD  =  5.87).  Individuals  received  0.20  USD  for  participating. 

Procedure.  Participants  stated  their  current  job  and  reported 
their  work-based  mastery  goals,  reasons,  and  goal  complexes.  As 
in  Study  1,  the  goal  and  reason  variables  were  counterbalanced: 
206  participants  completed  the  reason  items  first,  200  completed 
the  goal  items  first.  Then,  job  interest,  satisfaction,  and  positive 
emotion  were  assessed. 

Measures.  Table  2  presents  the  descriptive  statistics  and  cor¬ 
relation  matrix.  Participants  responded  using  a  1  =  not  at  all,  4  = 
somewhat,  7  =  completely  scale. 

Mastery  goals.  The  same  measure  used  in  the  prior  study  was 
used  in  this  study. 

Autonomous  and  controlled  reasons  for  goal  pursuit.  The 

same  measure  used  in  the  prior  study  was  used  in  this  study. 

Autonomous  and  controlled  mastery  goal  complexes.  Each 
of  the  three  items  measuring  mastery  goals  were  combined  with 
each  of  the  six  items  measuring  autonomous  and  controlled  rea¬ 
sons  to  assess  work-based  autonomous  and  controlled  mastery 
goal  complexes.  The  statements  thus  produced  were  presented  as 
“descriptions  of  how  you  might  pursue  goals  at  your  job.  together 
with  explanations  for  why  you  might  pursue  them.”  Six  items  (3 
goal  items  X  2  reason  items)  assessed  the  autonomous  mastery 
goal  complex  (e.g.,  “In  my  job,  my  goal  is  to  learn  as  much  as 
possible  because  I  find  this  a  highly  stimulating  and  challenging 
goal”),  and  12  items  (3  goal  items  X  4  reason  items)  assessed  the 
controlled  mastery  goal  complex  (e.g.,  “In  my  job,  my  goal  is  to 
learn  as  much  as  possible  because  others  will  reward  me  only  if  I 
achieve  this  goal”). 

Job  interest,  satisfaction,  and  positive  emotion.  Job  interest, 
satisfaction,  and  positive  emotion  were  assessed  using  the  same 
measures  used  in  Study  1. 

Results 

Overview.  We  used  the  same  analytical  strategy  as  in  Study  1, 
albeit  with  a  fourth  step  added  to  test  the  “goal  complex”  model. 
In  this  model,  mastery  goals,  autonomous  and  controlled  reasons, 
and  autonomous  and  controlled  mastery  goal  complexes  were 
included  as  predictors  (Model  4  in  Table  3).  This  enabled  us  to 
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estimate  the  incremental  contribution  of  the  autonomous  mastery 
goal  complex,  as  well  as  the  reduction  of  the  predictive  strength  of 
mastery  goals  and  autonomous  reasons  when  controlling  for  this 
goal  complex.4 

Preliminary  analysis.  As  in  Study  1,  we  conducted  a  prelim¬ 
inary  analysis  to  examine  potential  covariates  (sex,  age,  seniority) 
and  order  effects.  None  of  the  covariates  attained  significance 
(ps  >  .061),  excepting  a  positive  association  between  seniority  and 
interest,  B  =  0.02  [0,  0.04],  p  =  .025.  Although  no  order  main 
effects  were  observed  (ps  >  .634),  order  interacted  with  mastery 
goals  in  predicting  interest,  B  =  -0.26  [-0.49,  -0.04],/?  =  .021, 
and  with  autonomous  reasons  in  predicting  interest,  B  =  0.23 
[0.03,  0.42],  p  =  .021,  and  positive  emotion,  B  =  0.19  [0.01, 0.37], 
p  =  .042.  As  including  these  terms  was  neither  theoretically 
relevant  nor  changed  the  pattern  of  results,  they  were  not  consid¬ 
ered  further. 

Main  analyses.  Table  3  presents  the  full  set  of  results. 

“Goal-only”  model.  In  line  with  Hypothesis  la,  mastery 
goals  were  a  positive  predictor  of  interest,  B  =  0.67  [0.58,  0.77], 
p  <  .001;  satisfaction,  B  =  0.62  [0.51,  0.73].  p  <  .001;  and 
positive  emotion,  B  =  0.65  [0.56,  0.73],  p  <  .001. 

“Reason-only”  model.  In  line  with  Hypothesis  lb,  autono¬ 
mous  reasons  were  a  positive  predictor  of  interest,  B  =  0.68  [0.60, 
0.76],  p  <  .001;  satisfaction,  B  =  0.70  [0.62,  0.79],  p  <  .001;  and 
positive  emotion,  B  =  0.61  [0.54,  0.68],  p  <  .001. 

“Goal-and-reason”  model.  In  line  with  Hypothesis  2a,  mas¬ 
tery  goals  remained  a  positive  predictor  of  interest,  B  =  0.37  [0.26, 
0.48],  p  <  .001,  and  positive  emotion,  B  =  0.27  [0.16,  0.37],/?  < 
.001;  contrary  to  the  hypothesis,  mastery  goals  no  longer  predicted 
satisfaction,  B  =  0.08  [-0.04,  0.20],  p  =  .195.  In  line  with 
Hypothesis  2b,  autonomous  reasons  remained  a  positive  predictor 
of  interest,  B  —  0.49  [0.39,  0.58],  p  <  .001;  satisfaction,  B  =  0.66 
[0.55,  0.77],  p  <  .001;  and  positive  emotion,  B  =  0.47  [0.38,  0.56], 

p  <  .001. 

In  line  with  hypothesis  3a,  the  Monte  Carlo  method  revealed 
that  the  reduction  of  the  relations  between  mastery  goals  and 
interest,  B  =  0.35  [0.28,  0.44]  (49%  reduction),  satisfaction,  B  — 
0.48  [0.39,  0.58]  (86%),  and  positive  emotion,  B  =  0.34  [0.27, 
0.42]  (56%),  due  to  the  inclusion  of  autonomous  reasons  were 
significant  (Zs  >  8.54,  ps  <  .001).  In  line  with  Hypothesis  3b,  the 
reduction  of  the  relations  between  autonomous  reasons  and  both 
interest,  B  =  0.19  [0.13,  0.26]  (29%),  and  positive  emotion,  B  = 
0.14  [0.08,  0.20]  (23%),  due  to  the  inclusion  of  mastery  goals  were 
significant  (Zs  >  4.75,  ps  <  .001);  contrary  to  the  hypothesis,  the 
reduction  in  the  relation  between  autonomous  reasons  and  satis¬ 
faction,  B  =  0.04  [-0.02,  0.11]  (6%),  was  not  significant  (Z  = 
1.29,  p  =  .196). 

“Goal  complex”  model.  In  line  with  Hypothesis  4,  the  auton¬ 
omous  mastery  goal  complex  was  a  positive  predictor  of  interest, 
B  =  0.18  [0.03,  0.33],  p  -  .015;  satisfaction,  B  =  0.18  [0.02, 
0.34],  p  =  .031;  and  positive  emotion,  B  =  0.24  [0.10,  0.38],/?  < 
.001. 

Again,  we  used  the  Monte  Carlo  method  to  estimate  the  reduc¬ 
tion  of  the  predictive  strength  of  mastery  goals  and  autonomous 
reasons  when  controlling  for  the  autonomous  mastery  goal  com¬ 
plex.  In  line  with  Hypothesis  5a,  the  reduction  of  the  relations 
between  mastery  goals  and  both  interest  B  =  0.06  [0.01,  0.11] 
(18%),  and  positive  emotion  B  =  0.08  [0.03,  0.13]  (34%),  due  to 
the  inclusion  of  the  autonomous  mastery  goal  complex  were  sig¬ 


nificant  (Zs  >  2.34,  ps  <  .019;  mastery  goals  remained  a  signif¬ 
icant  predictor  in  both  instances,  ps  ^  .01).  The  analysis  was  not 
conducted  for  satisfaction,  given  the  null  relation  for  mastery  goals 
in  the  “goal-and-reason”  model.  In  line  with  Hypothesis  5b,  the 
reduction  of  the  relations  between  autonomous  reasons  and  inter¬ 
est,  B  =  0.10  [0.02,  0.17]  (20%),  satisfaction,  B  =  0.09  [0.01, 
0.18]  (14%),  and  positive  emotion,  B  =  0.13  [0.05,  0.20]  (27%), 
due  to  the  inclusion  of  the  autonomous  mastery  goal  complex  were 
significant  (Zs  >  2.14,  ps  <  .032;  autonomous  reasons  remained 
a  significant  predictor  in  all  instances,  ps  <  .001). 

Discussion 

Replicating  Study  l’s  findings,  mastery  goals  and  autonomous 
reasons  accounted  for  variance  in  interest,  satisfaction,  and  posi¬ 
tive  emotion  when  tested  separately,  and  also  explained  indepen¬ 
dent  variance  in  interest  and  positive  emotion  when  controlling  for 
the  other  variable  (with  the  predictive  strength  of  each  being 
diminished).  This  suggests  that  mastery  goals  and  autonomous 
reasons  overlap  without  canceling  one  another.  However,  as  in 
Study  1,  satisfaction  was  more  robustly  predicted  by  autonomous 
reasons  than  by  mastery  goals. 

Extending  Study  l’s  findings,  the  autonomous  mastery  goal 
complex  explained  incremental  variance  in  interest,  satisfaction, 
and  positive  emotion  (Hypothesis  4).  Thus,  mastery  goals  and 
autonomous  reasons  not  only  have  an  independent  influence  on 
adaptive  outcomes,  they  fuse  together  in  the  form  of  a  goal 
complex  that  has  additional  predictive  benefits.  Moreover,  the 
predictive  strength  of  mastery  goals  (Hypothesis  5a)  and  autono¬ 
mous  reasons  (Hypothesis  5b)  were  diminished  when  controlling 
for  the  autonomous  mastery  goal  complex.  In  line  with  Gillet  et  al. 
(2015)  findings  (from  our  perspective),  controlling  for  the  auton¬ 
omous  mastery  goal  complex  diminishes  the  predictive  strength  of 
mastery  goals  per  se;  however,  it  also  diminishes  the  predictive 
strength  of  autonomous  reasons  per  se. 

The  effect  sizes  for  mastery  goals  were  descriptively  smaller 
than  those  for  autonomous  reasons.  One  possible  reason  for  this  is 
the  nature  of  the  outcome  variables  used  in  the  first  two  studies. 
Building  on  existing  research,  we  used  experiential  outcomes, 
which  may  be  particularly  sensitive  to  feelings  of  task  autonomy 
(Ryan  &  Deci,  2006).  In  Study  3,  we  switched  to  self-regulated 
learning  outcomes,  which  may  be  equally  sensitive  to  mastery 
goals  and  autonomous  reasons  (see  Dysvik  &  Kuvaas,  2013). 
Specifically,  in  Study  3  we  tested  the  same  set  of  five  hypotheses 
with  the  following  self-regulated  learning  outcomes:  deep  learn¬ 
ing,  interpersonal  help-seeking  behavior,  and  challenging  tasks. 

Study  3.  Mastery  Goals,  Reasons,  Goal  Complexes, 
and  Self-Regulated  Learning 

Study  3  was  designed  to  test  mastery  goals,  SDT-derived  rea¬ 
sons,  and  achievement  goal  complexes  as  predictors  of  three 


4  Vansteenkiste,  Smeets,  et  al.  (2010)  noted  that  variables  connecting 
autonomous  or  controlled  reasons  to  a  given  achievement  goal  could  seem 
odd  for  a  participant  not  pursuing  this  achievement  goal.  Accordingly,  we 
repeated  the  analyses  for  the  full  study,  excluding  the  two  participants  with 
an  average  mastery  goal  score  below  2  (3  in  Study  3;  6  in  Study  4).  The 
results  for  the  achievement  goal  complex  variables  remained  essentially  the 
same  as  those  reported  in  the  text  (this  is  the  case  for  all  studies). 


1150 


SOMMET  AND  ELLIOT 


self-regulated  learning  outcomes.  Participants  reported  their  work- 
based  mastery  goals,  their  autonomous  and  controlled  reasons  for 
goal  pursuit,  and  their  autonomous  and  controlled  mastery  goal 
complexes.  They  also  reported  their  job  deep  learning,  help¬ 
seeking,  and  challenging  tasks. 

Method 

Participants.  The  target  sample  size  was  the  same  as  in  the 
prior  studies.  To  participate,  MTurk  workers  had  to  currently  have 
a  job  and  not  have  participated  in  Studies  1  or  2.  A  total  of  440 
participants  completed  the  questionnaire;  1 1  were  excluded  a 
priori  due  to  missing  data  on  the  outcome  variables.  The  final 
sample  consisted  of  429  U.S.  residents,  213  men  and  216  women, 
with  a  mean  age  of  34.19  ( SD  =  10.07),  and  having  held  their  job 
for  6.23  years  (SD  =  6.64).  Individuals  received  0.30  USD  for 
participating. 

Procedure.  Participants  stated. their  current  job  and  reported 
their  work-based  mastery  goals,  reasons,  and  goal  complexes. 
Again,  the  goal  and  reason  variables  were  counterbalanced:  211 
participants  completed  the  reason  items  first,  218  completed  the 
goal  items  first.  Then,  job  deep  learning,  help-seeking,  and  chal¬ 
lenging  tasks  were  assessed. 

Measures.  Table  4  presents  the  descriptive  statistics  and  cor¬ 
relation  matrix.  Participants  responded  using  a  1  =  not  at  all ,  4  = 
somewhat ,  7  =  completely  scale. 

Mastery  goals.  The  same  measure  used  in  prior  study  was 
used  in  this  study. 

Autonomous  and  controlled  reasons  for  goal  pursuit.  The 

same  measure  used  in  the  prior  study  was  used  in  this  study. 

Autonomous  and  controlled  mastery  goal  complexes.  The 
same  measure  used  in  the  prior  study  was  used  in  this  study. 

Job  deep  learning.  Kirby,  Knapper,  Evans,  Carty,  and  Gadu- 
la’s  (2003)  10-item  deep  subscale  from  the  Approaches  to  Learn¬ 
ing  at  Work  Questionnaire  assessed  job  deep  learning  (e.g.,  “I 
spend  a  good  deal  of  my  spare  time  learning  about  things  related 
to  my  work”). 

Job  help-seeking.  Holman,  Epitropaki,  and  Femie’s  (2001) 
three-item  interpersonal  help  seeking  subscale  from  the  Scale  of 
Learning  Strategies  in  the  Workplace  assessed  job  help-seeking 
(e.g.,  “I  ask  others  for  more  information  when  I  need  it  [at  my 
work]”). 

Job  challenging  tasks.  Preenen,  De  Pater,  Van  Vianen,  and 
Keijzer’s  (2011)  six-item  Challenging  Assignments  Scale  was 


adapted  to  assess  job  challenging  tasks  (e.g.,  “[In  my  work  I 
perform  tasks]  that  are  challenging' ). 

Results 

Overview.  We  used  the  same  analytical  strategy  used  in  Study 
2.  For  each  outcome  variable,  four  linear  regression  models  were 
built  (see  Models  1  to  4  in  Table  5). 

Preliminary  analysis.  As  in  Studies  1  and  2,  we  conducted  a 
preliminary  analysis  to  examine  potential  covariates  (sex,  age, 
seniority)  and  order  effects.  None  of  the  covariates  attained  sig¬ 
nificance  (ps  >  .083),  excepting  a  negative  association  between 
age  and  deep  learning,  B  =  —0.02  [—0.02,  —  0.01],  p  <  .001,  and 
a  positive  association  between  sex  and  help-seeking,  B  =  0.20 
[0.01,  0.38],  p  <  .001.  An  order  main  effect  was  observed  on 
help-seeking,  B  =  0.20  [0.01,  0.40],  p  =  .043,  as  well  as  an 
interactive  effect  with  autonomous  reasons  on  deep  learning, 
B  =  —0.13  [-0.25,  -0.02],  p  —  .022.  As  including  these  terms 
was  neither  theoretically  relevant  nor  changed  the  pattern  of  re¬ 
sults,  they  were  not  considered  further. 

Main  analyses.  Table  5  presents  the  full  set  of  results. 

“Goal-only”  model.  In  line  with  Hypothesis  la,  mastery 
goals  were  a  positive  predictor  of  deep  learning,  B  =  0.50  [0.43, 
0.58],  p  <  .001;  help-seeking,  B  =  0.38  [0.30,  0.46],  p  <  .001; 
and  challenging  tasks,  B  =  0.50  [0.42,  0.58],  p  <  .001. 

“Reason-only”  model.  In  line  with  Hypothesis  lb,  autono¬ 
mous  reasons  were  a  positive  predictor  of  deep  learning,  B  —  0.42 
[0.37,  0.47],  p  <  .001;  help-seeking,  B  =  0.16  [0.09,  0.22],  p  < 
.001;  and  challenging  tasks,  B  =  0.37  [0.32,  0.43],  p  <  .001. 

“Goal-and-reason”  model.  In  line  with  Hypothesis  2a,  mas¬ 
tery  goals  remained  a  positive  predictor  of  deep  learning,  B  =  0.26 
[0.18,  0.34],  p  <  .001;  help-seeking,  B  =  0.36  [0.26,  0.46],  p  < 
.001;  and  challenging  tasks,  B  =  0.28  [0.19,  0.37],  p  <  .001.  In 
line  with  Hypothesis  2b,  autonomous  reasons  remained  a  positive 
predictor  of  deep  learning,  B  =  0.32  [0.26,  0.38],  p  <  .001,  and 
challenging  tasks,  B  =  0.27  [0.20,  0.33],  p  <  .001;  contrary  to  the 
hypothesis,  these  reasons  no  longer  predicted  help-seeking  B  = 
0.02  [-0.05,  0.09],  p  =  .560. 

In  line  with  hypothesis  3a,  the  Monte  Carlo  method  revealed 
that  the  reduction  of  the  relations  between  mastery  goals  and  both 
deep  learning,  B  =  0.23  [0.18,  0.28]  (46%  reduction),  and  chal¬ 
lenging  tasks,  B  =  0.19  [0.14,  0.25]  (41%),  due  to  the  inclusion  of 
autonomous  reasons  were  significant  (Zs  >  6.82,  ps  <  .001); 
contrary  to  the  hypothesis,  the  reduction  in  the  relation  between 


Table  4 

Study  3:  Descriptive  Statistics  and  Correlation  Matrix  for  the  Main  Variables 


Descriptive  statistics  Correlation  matrix 


a 

M 

SD 

(1) 

(2) 

(3) 

(4) 

(5)  ♦ 

(6) 

(7) 

(8) 

Mastery  goals  (1) 

.88 

5.89 

1.18 

1.00 

Autonomous  reasons  (2) 

.87 

5.02 

1.56 

1.00 

Controlled  reasons  (3) 

.66 

4.67 

1.24 

.30*** 

J  g*** 

1.00 

Autonomous  mastery  goal  complex  (4) 

.91 

5.22 

1.44 

.64*** 

82*** 

.16*** 

1.00 

Controlled  mastery  goal  complex  (5) 

.95 

4.68 

1.23 

32*** 

2  j  *** 

2^*** 

.24*** 

1.00 

Job  deep  learning  strategy  (6) 

.87 

4.90 

1.08 

.55*** 

62*** 

.22*** 

3  ^  *>«* 

1.00 

Job  interpersonal  help-seeking  (7) 

.88 

5.91 

1.09 

.42*** 

.25*** 

.16’** 

.31*** 

1  g*** 

g*** 

1.00 

Job  challenging  tasks  (8) 

.85 

5.50 

1.13 

3  2  *  *  * 

.54*** 

.25*** 

.57*** 

.28*** 

.57*** 

.42***  1.00 

p  <  .001. 


Table  5 

Study  3:  Coefficient  Estimates  and  Effect  Sizes  for  the  Models  Testing  the  Influence  of  Mastery  Goals  Alone  ( Model  1;  “Goal-Only”  Model),  Autonomous  and  Controlled 
Reasons  Alone  (Model  2;  Reason-Only  Model),  Mastery  Goals  and  Reasons  (Model  3;  “Goal-and-Reason”  Model),  and  Mastery  goals,  reasons,  and  Mastery  Goal 
Complexes  (Model  4;  “Goal  Complex"  Model) 
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mastery  goals  and  help-seeking,  B  =  0.02  [-0.04,  0.07]  (4%),  was 
not  significant  (Z  <  1,  p  =  .560).  In  line  with  Hypothesis  3b,  the 
reduction  of  the  relations  between  autonomous  reasons  and  deep 
learning,  B  =  0.10  [0.07,  0.14]  (24%);  help-seeking,  B  =  0.14 
[0.10,  0.18]  (87%);  and  challenging  tasks,  B  =  0.11  [0.07,  0.14] 
(28%)  due  to  the  inclusion  of  mastery  goals  were  significant  (Zs  > 
5.52,  ps  <  .001). 

“Goal  complex”  model.  In  line  with  Hypothesis  4,  the  auton¬ 
omous  mastery  goal  complex  was  a  positive  predictor  of  deep 
learning,  B  =  0.34  [0.24,  0.43],  p  <  .001,  and  challenging  tasks, 
B  =  0. 18  [0.07,  0.30],  p  =  .001;  contrary  to  the  hypothesis,  the 
autonomous  mastery  goal  complex  did  not  predict  help-seeking, 
B  =  0.08  [-0.04,  0.21],  p  =  .205. 

In  line  with  Hypothesis  5a,  the  Monte  Carlo  method  revealed 
that  the  reduction  of  the  relations  between  mastery  goals  and  both 
deep  learning,  B  =  0.11  [0.07,  0.15]  (45%),  and  challenging  tasks, 
B  =  0.06  [0.02,  0.10]  (23%),  due  to- the  inclusion  of  the  autono¬ 
mous  mastery  goal  complex  were  significant  (Zs  >  3.01,  ps  < 
.003;  mastery  goals  remained  a  significant  predictor  in  both  in¬ 
stances,  ps  <  .001).  In  line  with  Hypothesis  5b,  the  reduction  of 
the  relations  between  autonomous  reasons  and  both  deep  learning, 
B  =  0.21  [0.15,  0.27]  (67%),  and  challenging  tasks,  B  =  0.11 
[0.04,  018]  (43%),  due  to  the  inclusion  of  the  autonomous  mastery 
goal  complex  were  significant  (Zs  >  3.17,  ps  <  .002;  autonomous 
reasons  remained  a  significant  predictor  in  both  instances,  ps  < 
.011).  The  analysis  was  not  conducted  for  help-seeking,  given  the 
null  relation  for  the  autonomous  mastery  goal  complex. 

Discussion 

Consistent  with  Studies  1  and  2,  mastery  goals  and  autonomous 
reasons  accounted  for  variance  in  deep  learning,  help-seeking,  and 
challenging  tasks  when  tested  separately,  and  also  explained  inde¬ 
pendent  variance  in  deep  learning  and  challenging  tasks  when 
tested  simultaneously  (with  the  predictive  strength  of  each  being 
diminished).  For  help-seeking,  however,  predictions  were  not  sup¬ 
ported.  Autonomous  reasons  no  longer  explained  a  significant 
portion  of  variance  in  help-seeking  when  mastery  goals  were 
controlled  for,  and  controlling  for  autonomous  reasons  did  not 
significantly  diminish  the  influence  of  mastery  goals.  Together 
with  the  Studies  1  and  2’s  findings  for  satisfaction,  this  indicates 
that  autonomous  reasons  may  be  a  more  reliable  predictor  of  some 
variables  (satisfaction)  and  mastery  goals  a  more  reliable  predictor 
of  others  (help-seeking).  Rather  than  concluding  that  one  construct 
unilaterally  reduces  the  predictive  utility  of  the  other,  it  seems  best 
to  view  both  as  important  predictors  that  vary  in  strength  as  a 
function  of  the  outcome  in  question. 

Moreover,  consistent  with  Study  2’s  findings,  the  autonomous 
mastery  goal  complex  explained  additional  variance  in  deep  learn¬ 
ing  and  challenging  tasks  (but  not  help-seeking),  and  diminished 
the  predictive  strength  of  mastery  goals  and  autonomous  reasons. 
Thus,  again,  the  autonomous  mastery  goal  complex  seems  impor¬ 
tant  to  consider,  and  it  seems  to  capture  some  of  the  variance 
explained  by  mastery  goals  per  se  and  autonomous  reasons  per  se. 

We  conducted  Study  4  in  the  academic  domain  rather  than  the 
work  domain  (see  Van  Yperen  et  ah,  2014,  on  the  importance  of 
attending  to  different  achievement  domains).  Study  4  had  a  threefold 
aim.  First,  we  sought  to  test  the  robustness  of  Study  3’s  findings 
regarding  mastery  goals,  autonomous  reasons,  and  the  autonomous 
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mastery  goal  complex  as  predictors  of  deep  learning  and  challenging 
tasks.  Second,  we  sought  to  extend  Studies  1-3’s  findings  by  testing 
our  hypotheses  with  performance  goals.  In  doing  so,  we  included  two 
outcome  variables  that  performance  goals  have  been  shown  to  posi¬ 
tively  predict  in  prior  research:  surface  learning  and  grade  aspiration 
(Elliot  &  McGregor,  2001;  McGregor  &  Elliot,  2002).  Third,  we 
sought  to  include  an  additional  outcome  variable  relevant  to  mastery 
goals,  performance  goals,  and  autonomous  reasons,  namely  study 
persistence  (Elliot,  McGregor,  &  Gable,  1999;  Vallerand  et  al.,  1997). 
We  tested  all  mastery  and  performance  goal  hypotheses  in  multiple 
regression  models  with  both  goals  included,  thereby  allowing  us  to 
determine  the  influence  of  each  goal  while  controlling  for  the  influ¬ 
ence  of  the  other. 

Study  4.  Achievement  Goals,  Reasons,  Goal 
Complexes,  and  Self-Regulated  Learning 

Study  4  was  designed  to  test  achievement  goals,  SDT-derived 
reasons,  and  achievement  goal  complexes  as  predictors  of  five  self- 
regulated  learning  outcomes  in  an  academic  context.  Students  re¬ 
ported  their  academic  mastery  and  performance  goals,  their  autono¬ 
mous  and  controlled  reasons  for  goal  pursuit,  and  their  autonomous 
and  controlled  mastery  and  performance  goal  complexes.  Participants 
also  reported  their  deep  learning,  surface  learning,  challenging  tasks, 
grade  aspiration,  and  study  persistence. 

First,  all  hypotheses  were  the  same  for  mastery  goals,  autono¬ 
mous  reasons,  and  the  autonomous  mastery  goal  complex  predict¬ 
ing  deep  learning  and  challenging  tasks.  Second,  the  hypotheses 
were  extended  to  performance  goals.  Performance  goals  were 
expected  to  be  a  positive  predictor  of  surface  learning  and  grade 
aspiration  (Hypothesis  la),  even  when  controlling  for  autonomous 
reasons  (Hypothesis  2a).  Because  autonomous  reasons  are  neither 
compatible  nor  incompatible  with  these  outcomes  (e.g.,  Donche, 
Maeyer,  Coertjens,  Van  Daal,  &  Van  Petegem,  2013;  Kusurkar, 
Ten  Cate,  Vos,  Westers,  &  Croiset,  2013),  Hypotheses  lb,  2b,  3a, 
and  3b,  were  not  formulated.  However,  as  autonomous  reasons 
may  be  an  ideal  motivational  foundation  from  which  to  effi¬ 
ciently  pursue  performance  goals,  the  autonomous  performance 
goal  complex  was  expected  to  explain  independent  variance  in 


surface  learning  and  grade  aspiration  (Hypothesis  4),  and  to 
lead  to  a  decrease  in  the  predictive  strength  of  performance 
goals  (Hypothesis  5a).  Given  the  absence  of  Hypothesis  lb, 
Hypothesis  5b  was  not  formulated.  Third,  mastery  goals  (Hy¬ 
pothesis  la),  performance  goals  (Hypothesis  la),  and  autono¬ 
mous  reasons  (Hypothesis,  lb)  were  each  expected  to  be  a 
positive  predictor  of  study  persistence',  accordingly,  all  remain¬ 
ing  hypotheses  (Hypotheses  2-5)  applied  to  the  relations  be¬ 
tween  the  focal  predictor  variables  (mastery  goals,  performance 
goals,  autonomous  reasons,  and  the  autonomous  achievement 
goal  complexes)  and  study  persistence. 

Method 

Participants.  The  target  sample  size  was  the  same  as  in  the 
prior  studies.  The  study  was  administered  via  the  SONA  Psychol¬ 
ogy  Research  Participation  System  of  a  medium-sized  U.S.  uni¬ 
versity.  A  total  of  48 1  participants  completed  the  questionnaire;  24 
were  excluded  a  priori  due  to  missing  data  on  the  outcome  vari¬ 
ables.  The  final  sample  consisted  of  457  students  from  various 
study  fields,  103  men  and  354  women,  with  a  mean  age  of  20.21 
( SD  =  1.77),  81  of  which  were  freshmen,  135  sophomores,  118 
juniors,  and  122  seniors  (1  “other”).  Individuals  received  0.5  extra 
course  credit  for  participating. 

Procedure.  Participants  reported  their  academic  achievement 
goals,  reasons,  and  goal  complexes.  Again,  the  goal  and  reason 
variables  were  counterbalanced:  234  participants  completed  the 
reason  items  first,  223  completed  the  goal  items  first.  Then,  deep 
and  surface  learning,  challenging  tasks,  grade  aspiration,  and  study 
persistence  were  assessed. 

Measures.  Table  6  presents  the  descriptive  statistics  and  cor¬ 
relation  matrix.  Participants  responded  using  a  1  =  not  at  all,  4  = 
somewhat,  7  =  completely  scale,  unless  otherwise  specified.  The 
items  for  all  predictor  variables  are  provided  in  the  Appendix. 

Mastery  and  performance  goals.  Elliot  and  Murayama’s 
(2008)  AGQ-R  was  used  to  assess  mastery  and  performance  goals. 
To  keep  the  achievement  goal  complex  variables  at  a  reasonable 
length,  we  used  only  two  items  to  assess  mastery  goals  and  two 


Table  6 

Study  4:  Descriptive  Statistics  and  Correlation  Matrix  for  the  Main  Variables 


Descriptive 

statistics  Correlation  matrix 


a 

M 

SD 

(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

(8) 

(9) 

(10) 

(11) 

(12) 

(13) 

Mastery  goals  (1) 

.78 

5.40 

1.19 

1.00 

Performance  goals  (2) 

.79 

5.21 

1.30 

.36**’ 

1.00 

Autonomous  reasons  (3) 

.77 

5.15 

1.14 

.62*** 

.30*** 

1.00 

Controlled  reasons  (4) 

.70 

4.32 

1.17 

.10* 

.10* 

1.00 

Autonomous  mastery  goal  complex  (5) 

.88 

5.18 

1.10 

.73*** 

.30*** 

.73*** 

.08f 

1.00 

t 

Controlled  mastery  goal  complex  (6) 

.87 

4.21 

1.17 

.13“ 

.39*** 

,09f 

.85*** 

.14“ 

1.00 

Autonomous  performance  goal  complex  (7) 

.88 

4.74 

1.31 

.29*** 

.60*“ 

.36*** 

.33*“ 

.42*“ 

.39*’* 

1.00 

Controlled  performance  goal  complex  (8) 

.90 

4.22 

1.27 

-.01 

49*.. 

.02 

72**. 

.02 

yn*** 

.53’** 

1.00 

Deep  learning  strategy  (9) 

.82 

4.61 

.91 

.48“* 

.22“* 

.56*’* 

.17*’* 

.58**’ 

2 1  *** 

.39*** 

.14“ 

1.00 

Surface  learning  strategy  (10) 

.84 

4.98 

.88 

.26“* 

.34“* 

2,.** 

.32“* 

‘ 24 *** 

.35’** 

.29*** 

.32*** 

.16’** 

1.00 

Challenging  tasks  (11) 

.82 

4.94 

.98 

.30“* 

.45*“ 

.18*“ 

44*** 

2 1  *** 

.34“* 

jg*** 

.43*** 

2Q*** 

1.00 

Grade  aspiration  (12) 

n/a 

10.22 

1.25 

.14“ 

.15“ 

.19*“ 

-.05 

2q*** 

-.06 

-.01 

9 1  *** 

.00 

.01 

1.00 

Persistence  (13) 

.85 

5.29 

1.15 

.48*** 

.36“* 

.49**’ 

.13“ 

.53*** 

.12“ 

.36*“ 

.09* 

.39*** 

.43’** 

.40*“ 

:  _25„, 

‘  1.00 

Note,  n/a  means  not  applicable  (i.e.,  the  scale  only  comprises  one  item). 
V  <  .10.  * p<  .05.  **p  <  .01.'  ***p  <  .001. 
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items  to  assess  performance  goals  (e.g.,  “My  goal  is  to  perform 
better  than  the  other  students”). 

Autonomous  and  controlled  reasons  for  goal  pursuit.  The 
same  measure  used  in  the  prior  study  was  used  in  this  study,  albeit 
“in  my  job”  was  replaced  by  “in  my  classes.” 

Autonomous  and  controlled  mastery  and  performance  goal 
complexes.  Autonomous  and  controlled  achievement  goal  com¬ 
plexes  were  operationalized  in  the  same  way  as  in  the  prior  studies 
( i.e. ,  by  combining  each  goal  statement  with  each  reason  state¬ 
ment):  Four  items  (2  goal  items  X  2  reason  items)  assessed  the 
autonomous  mastery  goal  complex,  eight  items  (2  goal  items  X  4 
reason  items)  assessed  the  controlled  mastery  goal  complex,  four 
items  (2  goal  items  X  2  reason  items)  assessed  the  autonomous 
performance  goal  complex,  and  eight  items  (2  goal  items  X  4 
reason  items)  assessed  the  controlled  performance  goal  complex. 

Deep  and  surface  learning.  Kirby  et  al.’s  (2003)  Approaches 
to  Learning  at  Work  Questionnaire  was  adapted  to  the  academic 
domain.  Ten  items  assessed  deep  learning  (e.g.,  “I  spend  a  good 
deal  of  my  spare  time  learning  about  things  related  to  my  classes”) 
and  10  items  assessed  surface  learning  (e.g.,  “The  best  way  for  me 
to  understand  what  technical  terms  me  is  to  remember  the  textbook 
definitions”). 

Challenging  tasks.  Preenen  et  al.’s  (201 1)  six-item  Challeng¬ 
ing  Assignments  Scale  was  adapted  to  the  academic  domain  to 
assess  challenging  tasks  (e.g.,  “[In  my  classes  I  perform  tasks]  that 
are  challenging”). 

Grade  aspiration.  McGregor  and  Elliot’s  (2002)  single  item 
measure  was  used  to  assess  grade  aspiration.  Participants  were 
asked  to  indicate  “the  minimum  average  grade  that  [they]  would  be 
satisfied  with  in  [their]  classes  this  semester”  using  a  12-point  scale 
ranging  from  A  to  F  (coded  A  =  12,  A—  =  11,  B+  =  10  .  .  .  , 
F  =  1). 

Study  persistence.  Elliot  et  al.’s  (1999)  four-item  persistence 
subscale  was  used  to  assess  study  persistence  (e.g.,  “When  some¬ 
thing  that  I  am  studying  gets  difficult,  I  spend  extra  time  and  effort 
trying  to  understand  it”). 


Results 

Overview.  We  used  the  same  analytical  strategy  used  in  Stud¬ 
ies  2  and  3,  albeit  performance  goals  were  included  in  the  goal 
models.  For  each  outcome  variable,  four  models  were  built:  the 
“goal-only”  model  (including  mastery  and  performance  goals; 
Model  1  in  Tables  7  and  8),  the  “reason-only”  model  (including 
autonomous  and  controlled  reasons;  Model  2  in  Tables  7  and  8), 
the  “goal-and-reason”  model  (including  mastery  and  performance 
goals  and  autonomous  and  controlled  reasons;  Model  3  in  Tables 
7  and  8),  and  the  “goal  complex”  model  (including  achievement 
goals,  reasons,  and  autonomous  and  controlled  mastery  and  per¬ 
formance  goal  complexes;  Model  4  in  Tables  7  and  8). 

Preliminary  analysis.  As  in  Studies  1-3,  we  conducted  a 
preliminary  analysis  to  examine  potential  covariates  (sex,  age,  year 
at  school)  and  order  effects.  None  of  the  covariates  attained  sig¬ 
nificance  (ps  >  .111),  excepting  a  negative  association  between 
sex  and  deep  learning,  B  =  -0.33  [-0.49,  -0.17],  p  <  .001,  and 
between  age  and  challenging  tasks,  B  =  —0.06  [—0.12,  0],  p  = 
.049.  Although  no  order  main  effects  were  observed  (ps  >  .116), 
order  interacted  with  performance  goals  in  predicting  persistence, 
B  =  —0.17  [—0.33,  —  0.01],  p  =  .042.  Again,  as  including  these 
terms  was  neither  theoretically  relevant  nor  changed  the  pattern  of 
results,  they  were  not  considered  further. 

Main  Analyses 

Deep  learning  and  challenging  tasks.  Table  7  presents  the 
full  set  of  results. 

“Goal-only”  model.  In  line  with  Hypothesis  la,  mastery 
goals  were  a  positive  predictor  of  deep  learning,  B  =  0.35  [0.28, 
0.42],  p  <  .001,  and  challenging  tasks,  B  =  0.25  [0.18,  0.33],  p  < 
.001. 

“Reason-only”  model.  In  line  with  Hypothesis  lb,  autono¬ 
mous  reasons  were  a  positive  predictor  of  deep  learning,  B  =  0.44 
[0.38,  0.50],  p  <  .001,  and  challenging  tasks,  B  =  0.38  [0.30, 
0.45],  p  <  .001. 


Table  7 

Study  4  (Deep  Learning  and  Challenging  Tasks):  Coefficient  Estimates  and  Effect  Sizes  for  the  Models  Testing  the  Influence  of 
Achievement  Goals  Alone  (Model  1;  “Goal-Only”  Model),  Autonomous  and  Controlled  Reasons  Alone  (Model  2;  “Reason-Only” 
Model),  Achievement  Goals  and  Reasons  (Model  3;  “ Goal-and-Reason”  Model),  and  Achievement  Goals,  Reasons,  and  Goal 
Complexes  (Model  4;  “Goal  Complex”  Model) 


Deep  learning  strategies 

Challenging  tasks 

Model  1 

Model  2 

Model  3 

Model  4 

Model  1 

Model  2 

Model  3 

Model  4 

B  Tip 

B  rip 

B  Tip 

B  "bp 

B  Tip 

B  Tip 

B  r\l 

B  Tip2 

Intercept 

2.52 

—  1.97 

— 

1.69 

— 

1.47 

— 

2.85 

— 

2.51 

— 

2.15 

— 

1.95 

— 

Mastery  goals  (MAp) 

.35*** 

.19 

> 

.04  > 

,08T 

.01 

.25*** 

.09 

> 

.10* 

.01  > 

.05 

— 

Performance  goals  (PAp) 

.04 

— 

-.02 

— 

-.09’ 

.01 

14*** 

.03 

.09* 

.01 

.04 

— 

Autonomous  reasons 

.44*** 

.31  > 

*** 

.14  > 

.22*** 

.05 

.38*** 

.19  > 

.08  > 

.21*** 

.03 

Controlled  reasons 

.09** 

.02 

.09** 

.02 

.00 

— 

j  2*** 

.02 

.08* 

.01 

-.01 

— 

Autonomous  MAp  complex 

.20*** 

.03 

.15* 

.01 

Controlled  MAp  complex 

.09 

— 

.04 

— 

Autonomous  PAp  complex 

■13**“ 

.03 

.05 

— 

Controlled  PAp  complex 

.00 

— 

.07 

— 

Note.  Variables  are  not  centered."  >  ”  means  that  the  predictive  strength  of  mastery  goals  in  Model  1  is  significantly  greater  than  the  predictive  strength 
of  mastery  goals  in  Model  3  (i.e.,  there  is  a  significant  reduction  from  Model  1  to  Model  3).  This  is  the  case  for  the  other  model  comparisons  (i.e.,  Model 
2  vs.  3,  and  Model  3  vs.  4)  and  variable  (i.e.,  autonomous  reasons)  as  well. 

><.10.  *  p  <  -05.  *><-01.  **><.001. 


Table  8 

Study  4  ( Surface  learning,  Grade  aspiration,  and  Study  Persistence):  Coefficient  Estimates  and  Effect  Sizes  for  the  Models  Testing  the  Influence  of  Achievement  Goals  Alone 
( Model  1;  “Goal-Only”  Model),  Autonomous  and  Controlled  Reasons  Alone  (Model  2;  “Reason-Only”  Model),  Achievement  Goals  and  Reasons  ( Model  3;  “Goal-and- 
Reason”  Model),  and  Achievement  Goals,  Reasons,  and  Goal  Complexes  ( Model  4,  “Goal  Complex”  Model) 
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“Goal-and-reason”  model.  In  line  with  Hypothesis  2a,  mas¬ 
tery  goals  remained  a  positive  predictor  of  deep  learning,  B  —  0.17 
[0.09,  0.24],  p  <  .001,  and  challenging  tasks,  B  =  0.10  [0.01,0.18], 
p  =  .031.  In  line  with  Hypothesis  2b,  autonomous  reasons  re¬ 
mained  a  positive  predictor  of  deep  learning,  B  =  0.34  [0.26, 
0.41],  p  <  .001,  and  challenging  tasks,  B  =  0.29  [0.20,  0.37],  p  < 
.001. 

In  line  with  hypothesis  3a,  the  Monte  Carlo  method  revealed 
that  the  reduction  of  the  relations  between  mastery  goals  and  both 
deep  learning,  B  =  0.19  [0.14,  0.24]  (53%  reduction),  and  chal¬ 
lenging  tasks,  B  =  0.16  [0.1 1,  0.22]  (63%),  due  to  the  inclusion  of 
autonomous  reasons  were  significant  (Zs  ^  5.85,  ps  <  .001).  In 
line  with  Hypothesis  3b,  the  reduction  of  the  relations  between 
autonomous  reasons  and  both  deep  learning,  B  =  0.10  [0.05,  0.14] 
(22%),  and  challenging  tasks,  B  =  0.06  [0.0 1 ,  0. 1 1  ]  ( 1 6%),  due  to 
the  inclusion  of  mastery  goals  were  significant  (Zs  S  2.15,  /?s  ^ 
.032). 

“Goal  complex”  model.  In  line  with  Hypothesis  4,  the  auton¬ 
omous  mastery  goal  complex  was  a  positive  predictor  of  deep 
learning,  B  =  0.20  [0.10,  0.31],  p  <  .001,  and  challenging  tasks, 
B  =  0.15  [0.02,  0.28],  p  =  .023. 

In  line  with  Hypothesis  5a,  the  Monte  Carlo  method  revealed 
that  the  reduction  of  the  relations  between  mastery  goals  and  both 
deep  learning,  B  =  0.08  [0.04,  0.13]  (49%),  and  challenging  tasks, 
B  =  0.06  [0.01,  0.11]  (56%),  due  to  the  inclusion  of  the  autono¬ 
mous  mastery  goal  complex  were  significant  (Zs  &  2.24,  ps  ^ 
.025;  mastery  goals  respectively  became  a  marginal,  p  =  .057,  and 
a  nonsignificant,  p  =  .374,  predictor).  In  line  with  Hypothesis  5b, 
the  reduction  of  the  relations  between  autonomous  reasons  and 
both  deep  learning,  B  =  0.08  [0.04,  0.13]  (27%),  and  challenging 
tasks,  B  =  0.06  [0.01,  0.11]  (22%),  due  to  the  inclusion  of  the 
autonomous  mastery  goal  complex  were  significant  (Zs  >  2.24, 
ps  <  .025;  autonomous  reasons  remained  a  significant  predictor  in 
both  instances,  ps  <  .001). 

Surface  learning  and  grade  aspiration.  Table  8  presents  the 
full  set  of  results. 

“Goal-only”  model.  In  line  with  Hypothesis  la,  performance 
goals  were  a  positive  predictor  of  surface  learning,  B  —  0.19  [0.13, 
0.25],  p  <  .001,  and  grade  aspiration,  B  =  0.12  [0.02,  0.21],  p  = 
.0 1 8.5 

“Goal-and-reason”  model.  In  line  with  Hypothesis  2a,  per¬ 
formance  goals  remained  a  positive  predictor  of  surface  learning, 
B  =  0.12  [0.06,  0.19],  p  <  .001,  and  grade  aspiration,  B  =  0.15 
[0.05,  0.26],  p  =  .004.  Hypothesis  2b,  3a,  and  3b  were  not 
formulated. 

“Goal  complex”  model.  Contrary  to  Hypothesis  4,  the  auton¬ 
omous  performance  goal  complex  was  not  a  positive  predictor  of 
surface  learning,  B  =  0.02  [-0.07,  0.10],  p  =  .708;  in  line  with 
Hypothesis  4,  the  autonomous  performance  goal  complex  was  a 
positive  predictor  of  grade  aspiration,  B  f=  0. 13  [0,  0.27],  p  =  .047. 

Hypothesis  5a  was  not  tested  for  surface  learning,  given  the  null 
result  for  the  autonomous  performance  goal  complex.  In  line  with 
Hypothesis  5a,  the  Monte  Carlo  method  revealed  that  the  36% 
reduction  of  the  relation  between  performance  goals  and  grade 


5  Thirty-eight  participants  did  not  provide  an  answer  to  the  single-item 
grade  aspiration  scale;  they  were  treated  as  missing  values  for  this  outcome 
variable. 
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aspiration  due  to  the  inclusion  of  the  autonomous  performance 
goal  complex  was  significant,  B  =  0.05,  [0,  0.10]  (although  Z  = 
1.94,  p  —  .051;  performance  goals  became  a  nonsignificant  pre¬ 
dictor,  p  —  .158).  Hypothesis  5b  was  not  formulated. 

Persistence.  Table  8  presents  the  full  set  of  results. 

“Goal-only”  model.  In  line  with  Hypothesis  la,  both  mastery 
goals  and  performance  goals  were  a  positive  predictor  of  study 
persistence,  B  =  0.39  [0.31,  0.47],  p  <  .001,  and  B  =  0.19  [0.11, 
0.26],  p  <  .001,  respectively. 

“Reason-only”  model.  In  line  with  Hypothesis  lb,  autono¬ 
mous  reasons  were  a  positive  predictor  of  study  persistence,  B  = 
0.48  [0.40,  0.57],  p  <  .001. 

“Goal-and-reason”  model.  In  line  with  Hypothesis  2a,  both 
mastery  goals,  B  =  0.23  [0.13,  0.32],  p  <  .001,  and  performance 
goals,  B  =  0.16  [0.08,  0.24],  p  <  .001,  remained  a  positive 
predictor  of  study  persistence.  In  line  with  Hypothesis  2b,  auton¬ 
omous  reasons  remained  a  positive  predictor  of  study  persistence, 
B  =  0.29  [0.19,  0.39],/?  <  .001. 

In  line  with  hypothesis  3a,  the  Monte  Carlo  method  revealed 
that  the  42%  reduction  of  the  relation  between  mastery  goals  and 
study  persistence  due  to  the  inclusion  of  autonomous  reasons  was 
significant,  B  =  0.16  [0.11,  0.22]  (Z  =  5.42,  p  <  .001);  the 
corresponding  1 1  %  reduction  of  the  relation  between  performance 
goals  and  study  persistence  was  marginal,  B  =  0.02  [0,  0.04]  (Z  = 
1.77,  p  =  .077).  In  line  with  Hypothesis  3b,  the  31%  reduction  of 
the  relation  between  autonomous  reasons  and  study  persistence 
due  to  the  inclusion  of  mastery  goals  was  significant,  B  =  0.13 
[0.07,  0.19]  (Z  =  4.39 ,  p  <  .001);  the  corresponding  6%  reduction 
of  the  relation  between  autonomous  reasons  and  study  persistence 
due  to  the  inclusion  of  performance  goals  was  marginal,  B  =  0.02 
[0,  0.04]  (Z  =  1.69,/?  =  .092). 

“Goal  complex”  model.  In  line  with  Hypothesis  4,  the  auton¬ 
omous  mastery  goal  complex  was  a  positive  predictor  of  study 
persistence,  B  =  0.25  [0.11,  0.40],/?  <  .001,  and  the  autonomous 
performance  goal  complex  was  a  marginally  significant  positive 
predictor,  B  =  0.08  [—0.01,  0.18],  p  =  .092. 

In  line  with  Hypothesis  5a,  the  Monte  Carlo  method  revealed 
that  the  45%  reduction  of  the  relation  between  mastery  goals  and 
study  persistence  due  to  the  inclusion  of  the  autonomous  mastery 
goal  complex  was  significant,  B  =  0.10  [0.04,  0.16]  (Z  =  3.36, 
p  <  .001;  mastery  goals  remained  a  positive  predictor,  p  =  .035). 
The  18%  reduction  of  the  relation  between  performance  goals  and 
study  persistence  due  to  the  inclusion  of  the  autonomous  perfor¬ 
mance  goal  complex  was  marginal,  B  =  0.03  [0,  0.07]  (Z  =  1.66, 
p  =  .098).  In  line  with  Hypothesis  5b,  the  39%  reduction  of  the 
relation  between  autonomous  reasons  and  study  persistence  due  to 
the  inclusion  of  the  autonomous  mastery  goal  complex  was  sig¬ 
nificant,  B  =  0.10  [0.04,  0.16]  (Z  =  3.36,  p  <  .001;  autonomous 
reasons  remained  a  positive  predictor,  p  =  .009);  the  correspond¬ 
ing  4%  reduction  due  to  the  inclusion  of  the  autonomous  perfor¬ 
mance  goal  complex  was  nonsignificant,  B  =  0.10  [0,  0.23]  (Z  = 
1.13,/?  =  .260). 

Discussion 

Replicating  Study  3’s  findings,  mastery  goals  and  autonomous 
reasons  accounted  for  variance  in  deep  learning  and  challenging 
tasks  when  tested  separately  or  simultaneously  (with  the  predictive 
strength  of  each  being  diminished).  Moreover,  the  autonomous 


mastery  goal  complex  explained  additional  variance  in  deep  learn¬ 
ing  and  challenging  tasks,  and  diminished  the  predictive  strength 
of  both  mastery  goals  and  autonomous  reasons. 

Extending  Study  3’s  findings,  performance  goals  accounted  for 
variance  in  surface  learning  and  grade  aspiration,  when  testing 
goals  and  reasons  separately  or  simultaneously.  Moreover,  the 
autonomous  performance  goal  complex  explained  additional  vari¬ 
ance  in  grade  aspiration,  and  diminished  the  predictive  strength  of 
performance  goals.  The  autonomous  performance  goal  complex 
did  not  explain  additional  variance  in  surface  learning. 

Further  extending  Study  3’s  findings,  mastery  goals,  perfor¬ 
mance  goals,  and  autonomous  reasons  accounted  for  variance  in 
study  persistence  when  testing  goals  and  reasons  separately  or 
simultaneously  (with  the  predictive  strength  of  each  being  dimin¬ 
ished).  Moreover,  the  autonomous  mastery  and  performance  goal 
complexes  explained  additional  variance  in  persistence,  and  di¬ 
minished  the  predictive  strength  of  mastery  goals,  performance 
goals,  and  autonomous  reasons.  The  reductions  of  the  influence  of 
performance  goals  and  the  influence  of  the  autonomous  perfor¬ 
mance  goal  complex  only  attained  marginal  significance. 

General  Discussion 

Although  research  on  achievement  goals  and  reasons  has  only 
recently  commenced,  there  has  been  a  growing  interest  in  studying 
the  SDT-derived  reasons  connected  to  achievement  goals  (see 
Vansteenkiste,  Lens,  et  al.,  2014).  The  findings  from  this  work 
have  often  been  interpreted  as  indicating  that  the  influence  of 
achievement  goals  on  beneficial  outcomes  is  reducible  to  the 
influence  of  reasons.  In  the  present  research,  we  developed  a 
systematic  approach  to  studying  goals,  reasons,  and  goal  com¬ 
plexes,  and  utilized  this  approach  to  clearly  differentiate  between 
the  influence  of  achievement  goals,  autonomous  and  controlled 
reasons,  and  achievement  goal  complexes.  Our  results  revealed 
that  all  three  types  of  variables  accounted  for  independent  variance 
in  experiential  and  self-regulated  learning  outcomes. 

Summary  of  Findings 

First,  we  documented  the  separate  influence  of  mastery  goals 
and  autonomous  reasons  for  goal  pursuit.  On  the  one  hand,  mas¬ 
tery  goals  were  found  to  be  a  positive  predictor  of  beneficial 
experiential  (satisfaction,  interest,  and  positive  emotion)  and  self- 
regulated  learning  (deep  learning,  interpersonal  help-seeking,  chal¬ 
lenging  tasks,  and  persistence)  outcomes.  This  replicates  basic 
findings  from  the  achievement  goal  literature,  showing  that  mas¬ 
tery  goals  enhance  the  subjective  value  of  the  achievement  activity 
and  foster  interest-based  learning  processes  (Daniels  et  al.,  2009). 
On  the  other  hand,  autonomous  reasons  were  found  to  be  a  positive 
predictor  of  the  same  beneficial  outcomes.  This  replicates  basic 
findings  from  the  SDT  literature,  showing  that  reasons  involving 
the  self-endorsement  of  one’s  actions  enhance  task  enjoyment  and 
facilitate  growth  (Deci  et  al.,  1991). 

Second,  we  documented  the  simultaneous  influence  of  mastery 
goals  and  autonomous  reasons  for  goal  pursuit.  On  the  one  hand, 
both  mastery  goals  and  autonomous  reasons  were  found  to  explain 
independent  variance  in  most  of  the  beneficial  experiential  (inter¬ 
est  and  positive  emotion)  and  self-regulated  learning  (deep  learn¬ 
ing,  challenging  tasks,  and  persistence)  outcomes.  This  illustrates 
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that  mastery  goals  and  autonomous  reasons  are  distinct  motiva¬ 
tional  constructs,  presumably  having  similar  influences  via  differ¬ 
ent  processes  (Dysvik  &  Kuvaas,  2010).  On  the  other  hand,  the 
predictive  strength  of  mastery  goals  and  autonomous  reasons  for 
these  same  outcomes  were  each  found  to  be  diminished  when 
controlling  for  the  other.  This  illustrates  that  mastery  goals  and 
autonomous  reasons  are  overlapping  motivational  constructs,  both 
pertaining  to  an  internal  investment  in  the  value  of  learning  (Elliot, 
&  Church,  1997).  However,  controlling  for  mastery  goals  elimi¬ 
nated  the  link  between  autonomous  reasons  and  interpersonal 
help-seeking,  whereas  controlling  for  autonomous  reasons  elimi¬ 
nated  the  link  between  mastery  goals  and  satisfaction.  This  sug¬ 
gests  that  the  influence  of  reasons  may  outweigh  the  influence  of 
goals  for  some  outcomes,  but  that  the  influence  of  goals  may 
outweigh  the  influence  of  reasons  for  other  outcomes. 

Third,  we  documented  the  influence  of  the  autonomous  mastery 
goal  complex  together  with  mastery  goals  and  autonomous  reasons 
for  goal  pursuit.  On  the  one  hand,  the  autonomous  mastery  goal 
complex  was  found  to  explain  incremental  variance  in  all  of  the 
beneficial  experiential  outcomes  (interest,  satisfaction,  and  posi¬ 
tive  emotion)  and  most  of  the  beneficial  self-regulated  learning 
outcomes  (i.e.,  deep  learning,  challenging  tasks,  and  persistence). 
This  indicates  that  the  autonomous  mastery  goal  complex  is  more 
than  the  mere  sum  of  a  mastery  goal  and  autonomous  reasons: 
Autonomous  reasons  may  give  deeper  psychological  meaning  to 
the  mastery  goal,  and  the  mastery  goal  may  then  foster  a  pleasur¬ 
able,  interest-driven  approach  to  learning  (Ryan  &  Deci,  2006).  On 
the  other  hand,  the  predictive  strength  of  mastery  goals  and  au¬ 
tonomous  reasons  regarding  these  same  outcomes  were  each  found 
to  be  diminished  when  controlling  for  the  autonomous  mastery 
goal  complex.  This  is  likely  due  to  measurement  redundancy: 
Mastery  goals  and  autonomous  reasons  were  each  measured  (at 
least)  two  times,  first  as  a  “pure”  goal  or  a  “pure”  reason,  and 
second  as  a  part  of  the  autonomous  mastery  goal  complex.  How¬ 
ever,  for  many  outcomes,  mastery  goals  and  autonomous  reasons 
still  explained  residual  variance  after  controlling  for  the  autono¬ 
mous  mastery  goal  complex.  Hence,  it  appears  that  mastery  goals 
in  and  of  themselves  (or,  perhaps  more  accurately,  mastery  goals 
energized  by  reasons  not  captured  by  the  goal  complexes  exam¬ 
ined  herein)  and  autonomous  reasons  in  and  of  themselves  (or, 
perhaps  more  accurately,  autonomous  reasons  directed  by  aims  not 
captured  by  the  goal  complexes  examined  herein)  each  have  re¬ 
maining,  substantive  predictive  utility. 

Fourth,  we  also  documented  the  influence  of  performance  goals 
and  performance  goal  complexes.  Performance  goals  were  found 
to  be  a  positive  predictor  of  surface  learning,  grade  aspiration,  and 
study  persistence,  even  after  controlling  for  reasons  for  goal  pur¬ 
suit.  Moreover,  the  autonomous  performance  goal  complex  ex¬ 
plained  incremental  variance  in  grade  aspiration  and  study  persis¬ 
tence,  resulting  in  the  diminution  of  the  predictive  strength  of  both 
performance  goals  (for  grade  aspiration)  and  autonomous  reasons 
(for  persistence).  In  the  same  way  as  for  mastery  goals,  these 
results  show  that  performance  goal  content  matters,  and  does  so  in 
two  ways:  The  influence  of  performance  goals  is  not  reducible  to 
the  influence  of  reasons,  and  the  pattern  of  results  associated  with 
the  autonomous  performance  goal  complex  differs  from  that  asso¬ 
ciated  with  the  autonomous  mastery  goal  complex. 

Fifth,  in  ancillary  analyses  we  observed  the  influence  of  con¬ 
trolled  achievement  goal  complexes.  In  nearly  all  instances,  con¬ 


trolled  achievement  goal  complexes  did  not  explain  incremental 
variance  in  the  beneficial  experiential  and  self-regulated  learning 
outcomes  (the  lone  exception— of  22  instances— being  controlled 
mastery  goal  complexes  and  deep  learning  in  Study  2).  Mastery 
and  performance  goals  do  not  seem  to  provide  supplementary 
benefits  when  combined  with  controlled  reasons,  which  is  consis¬ 
tent  with  research  showing  that  endorsing  these  goals  for  self¬ 
presentation  purposes  (a  form  of  controlled  reason)  lessens  or 
eliminates  their  positive  influence  (Dompnier,  Damon,  &  Butera, 
2013;  Smeding  et  al.,  2015).  s 

Both  Goals  and  Reasons  Are  Needed  for  a  Full 
Account  of  Motivation 

The  present  research  echoes  a  past  controversy  in  the  motivation 
literature.  SDT  researchers  have  long  distinguished  between  in¬ 
trinsic  (e.g.,  growth,  relationships,  community)  and  extrinsic  (e.g., 
wealth,  fame,  image)  goal  content  (for  a  review,  see  Vansteenkiste, 
Lens,  &  Deci,  2006).  Intrinsic  goals  tend  to  predict  beneficial 
outcomes,  whereas  extrinsic  goals  tend  to  predict  detrimental 
outcomes  (Kasser  &  Ryan,  1996).  In  the  late  1990s,  the  relation 
between  intrinsic  goals  and  a  self-regulation  outcome  (self- 
actualization)  was  found  to  be  eliminated  when  partialing  out  the 
influence  of  the  autonomous  and  controlled  reasons  connected  to 
these  goals  (Carver  &  Baird,  1998).  The  authors  interpreted  this 
finding  as  suggesting  that  “it  often  matters  more  why  a  goal  is 
being  pursued  than  what  the  goal  is”  (p.  292).  Later,  the  relation 
between  extrinsic  goals  and  an  experiential  outcome  (well-being) 
was  also  found  to  be  eliminated  when  controlling  for  the 
autonomous-like  (i.e.,  freedom  of  action  motives)  and  controlled- 
like  (i.e.,  appearing  worthy  in  others’  eyes)  reasons  connected  to 
these  goals  (Srivastava,  Locke,  &  Bartol,  2001).  Here  too  the 
conclusion  was  reached  that  the  predictive  utility  of  goals  is 
negligible  once  reasons  are  considered. 

However,  Sheldon,  Ryan,  Deci,  and  Kasser  (2004)  critiqued  the 
aforementioned  research,  highlighting  that  goal  assessment  was 
confounded  with  reason  assessment.  After  refining  the  methodol¬ 
ogy  of  the  prior  work,  Sheldon  et  al.  (2004)  demonstrated  that  both 
goal  content  (i.e.,  intrinsic  vs.  extrinsic  goals)  and  goal  motives 
(i.e.,  autonomous  vs.  controlled  reasons)  made  significant  and 
independent  contributions  to  psychological  well-being.  They  came 
to  the  conclusion  that  neither  the  directive  focus  of  goals  nor  the 
dynamic  processes  underlying  goals  was  more  critical  than  the 
other  (for  similar  work  showing  that  both  goal  content  and  reasons 
are  important  to  understand  outcomes  in  the  exercise  domain,  see 
Sebire,  Standage,  &  Vansteenkiste,  2009). 

Similar  reasoning  applies  to  the  emerging  research  on  goal 
complexes  within  the  achievement  domain.  In  prior  work,  the 
relation  between  achievement  goals  and  a  series  of  achievement¬ 
relevant  outcomes  (e.g.,  positive  emotion,  engagement,  persis¬ 
tence)  was  found  to  be  eliminated  wfien  partialing  out  the  influ¬ 
ence  of  the  autonomous  reasons  connected  to  these  goals  (see 
Gillet  et  al.,  2015;  Vansteenkiste,  Mouratidis,  et  al.,  2010;  Vans¬ 
teenkiste,  Smeets,  et  al.,  2010).  Because  this  prior  work  did  not 
include  “pure  reason”  assessments,  we  believe  that  this  type  of 
reduction  should  be  interpreted  with  caution.  Indeed,  our  findings 
indicate  that  the  influence  of  achievement  goal  content  is  not 
reducible  to  the  influence  of  achievement  goal  motives.  The  influ¬ 
ence  of  achievement  goals  is  not  unilaterally  exceeded  by  the 
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influence  of  reasons,  and  the  influence  of  achievement  goal  com¬ 
plexes  both  depends  on  the  type  of  goal  and  the  type  of  reason  they 
encompass.  As  such,  it  is  best  for  scholars  to  resist  “either-or” 
perspectives  on  achievement  motivation:  Not  only  do  reasons  for 
goal  pursuit  matter,  but  the  goals  themselves  matter  as  well.  Thus, 
we  concur  with  Vansteenkiste,  Mouratidis,  et  al.’s  (2014)  state¬ 
ment  that  “reasons  [should]  not  [be]  meant  to  replace  the  achieve¬ 
ment  goals  themselves”  (p.  142). 

Short-Term  and  Long-Term  Research  Directions 

We  believe  that  a  clear  conceptual  and  empirical  disentangle¬ 
ment  of  achievement  goals  and  reasons  brings  a  fresh,  exciting, 
and  generative  perspective  to  the  achievement  goal  literature.  In 
the  short  term,  researchers  may  consider  adopting  a  cumulative 
approach  that  involves  further  investigating  the  influence  of 
achievement  goals,  reasons,  and  achievement  goal  complexes  on 
achievement-relevant  outcomes.  Specifically,  researchers  may  fo¬ 
cus  on  other  achievement  goals  (e.g.,  avoidance-based  goals;  see 
Gillet  et  al„  2015),  non  SDT-derived  reasons  (e.g.,  achievement 
motives,  Elliot,  1999;  social  motivation,  Ryan  &  Shim,  2008; 
competitive  motives,  Murayama  &  Elliot,  2012),  unusual  goal 
complexes  (e.g.,  formed  upon  the  adoption  of  maladaptive  goals 
and  adaptive  reasons,  such  as  the  autonomous  performance- 
avoidance  complex;  see  Heidemeier  &  Wiese,  2014),  and/or  a 
wider  range  of  outcomes  (e.g.,  beneficial  and  detrimental;  see 
Senko,  2016). 

In  the  long-term,  researchers  may  consider  adopting  a  more 
comprehensive  approach  that  involves  moving  beyond  comparison 
of  the  influence  of  achievement  goals,  reasons,  and  achievement 
goal  complexes.  Conceptualizing  and  operationalizing  achieve¬ 
ment  goal  complexes  raise  two  important,  intertwined  issues  that 
need  to  be  addressed  in  future  work:  Complexity  and  ecological 
validity.  Regarding  complexity,  the  most  elaborate  achievement 
goal  framework  encompasses  3X2  achievement  goals  (i.e.,  task-, 
self-,  and  other-based  standards  crossed  with  approach  and  avoi¬ 
dance;  Elliot,  Murayama,  &  Pekrun,  2011),  and  the  self- 
determination  framework  encompasses  five  main  types  of  reasons 
(i.e.,  extrinsic  reasons  with  external,  introjected,  identified,  or 
integrated  regulation,  and  intrinsic  reasons;  Ryan  &  Deci,  2000). 
Fully  integrating  these  frameworks  would  result  in  3  X  2  X  5  = 
30  possible  achievement  goal  complexes,  which  are  clearly  too 
many  to  rigorously  study  at  the  same  time.  As  such,  it  is  important 
for  researchers  to  select  a  subset  of  achievement  goals  and  reasons 
in  any  given  investigation  to  avoid  overtaxing  participants  with  a 
large  number  of  related  and  (seemingly)  redundant  questions 
(which  would  undoubtedly  yield  poor  quality  data). 

Regarding  ecological  validity,  researchers  may  consider  which 
achievement  goal  complexes  are  more  commonly  encountered  in 
real-life  achievement  settings.  It  is  known  that  mastery-approach, 
performance-approach,  and  performance-avoidance  are  spontane¬ 
ously  generated  by  participants  (in  their  own  words)  in  open-ended 
questions  or  semistructured  interviews  (Lee  &  Bong,  2016;  Levy, 
Kaplan,  &  Patrick,  2004;  Urdan,  2004b).  However,  little  is  known 
about  the  spontaneously  generated  reasons  behind  mastery- 
approach,  performance-approach,  and  performance-avoidance 
goals  (for  an  exception,  see  Urdan  &  Mestas,  2006).  Future 
research  would  benefit  from  using  inductive  methods  to  deter¬ 
mine  the  most  prevalent  achievement  goal-reason  combinations 


(and  whether  SDT  or  some  other  approach  or  approaches  to 
motivation  is/are  best  suited  to  conceptualize  these  achievement 
goal  complexes)  and  using  deductive  methods  to  estimate  their 
consequences  for  achievement-relevant  outcomes.  Such  a 
mixed  method  research  program  (see  Johnson  &  Onwuegbuzie, 
2004)  would  help  motivation  scientists  to  focus  their  conceptual 
attention  and  empirical  effort  on  variables  of  foremost  practical 
significance. 

Limitations 

The  limitations  of  our  work  should  be  acknowledged.  First,  the 
present  studies  were  correlational  and  relied  on  single-session  data 
collections.  Hence,  we  cannot  establish  the  causal  nature  of  the 
motivation-to-outcome  relations.  Subsequent  research  using  pro¬ 
spective  methods  is  needed  to  acquire  more  precise  insight  into 
these  dynamics.  For  instance,  motivational  and  outcome  variables 
could  be  assessed  at  different  times  (as  in  Harackiewicz  et  al., 
1997)  or  a  longitudinal  design  could  be  employed  (as  in  Daniels  et 
al.,  2009). 

Second,  mastery  goals  and  autonomous  reasons  were  moder¬ 
ately  to  highly  correlated  (r  «  .60),  as  in  past  research  (e.g.,  Katz 
et  al.,  2008).  That  is,  the  two  motivational  constructs  are  multi- 
collinear,  suggesting  that  mastery  goals  are  primarily  pursed  for 
autonomous  reasons  (see  Senko  &  Tropiano,  2016).  However,  it 
should  be  noted  that  multicollinearity  is  not  a  violation  of  the 
assumptions  of  ordinary  least  squares  estimation  (Freud  &  Littell, 
2000).  Multiple  regression  analysis  has  enabled  us  to  estimate  the 
unique  variance  explained  by  mastery  goals,  after  removing  the 
shared  variance  associated  with  autonomous  reasons  (and  vice 
versa).  The  only  risk  with  multicollinearity  stems  from  a  lack  of 
information  in  the  data  (e.g.,  participants  with  high  mastery  goals 
and  low  autonomous  reasons  are  unusual;  see  Brambor,  Clark,  & 
Golder,  2006).  In  this  regard,  multicollinearity  may  have  increased 
the  probability  of  Type  II  error  (false  negative)  but  not  that  of  Type 
I  error  (false  positive;  see  Mason  &  Perreault  Jr,  1991). 

Third,  the  assessment  of  our  main  theoretical  constructs,  namely 
mastery  goals,  autonomous  reasons,  and  beneficial  outcomes,  may 
be  subject  to  social  desirability  (see  Darnon,  Dompnier,  Delmas, 
Pulfrey,  &  Butera,  2009;  Lepper,  Corpus,  &  Iyengar,  2005).  Thus, 
the  link  between  these  constructs  might  be  partially  explained  by 
covarying  interindividual  differences  in  self-presentation.  How¬ 
ever,  it  is  important  to  note  that  such  impression-management 
issues  cannot  account  for  the  robust  finding  that  both  achievement 
goals  and  reasons  have  independent  predictive  utility.  Neverthe¬ 
less,  subsequent  research  would  benefit  from  controlling  for  social 
desirability  and  incorporating  behavioral  measures  falling  outside 
the  categories  of  the  variables  studied  in  the  present  article  (e.g., 
achievement,  see  Senko,  Hulleman,  &  Harackiewicz,  2011). 

Fourth,  our  studies  were  based  on  U.S.  samples.  The  levels  of 
both  achievement  goals  and  self-determined  motivation  have  been 
found  to  vary  somewhat  across  culture  (Chirkov  &  Ryan,  2001; 
Dekker  &  Fischer,  2008),  as  have  predictive  patterns  for  achieve¬ 
ment  goals  (Zan,  Xiang,  Louis,  Jianmin,  &  YunPeng,  2008;  see 
Chirkov,  2009  on  autonomous  motivation,  which  may  have  more 
universal  predictive  power).  Given  these  cross-cultural  differ¬ 
ences,  research  is  needed  to  test  the  predictive  utility  of  achieve¬ 
ment  goals,  reasons,  and  achievement  goal  complexes  in  a  broader 
array  of  countries. 
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Conclusion 

The  achievement  goals  approach  to  achievement  motivation 
identifies  a  number  of  possible  goal  contents  in  competence¬ 
relevant  contexts  that  vary  according  to  how  competence  is  de¬ 
fined  and  valenced  (Elliot  et  al.,  2011),  whereas  SDT  designates  a 
continuum  of  possible  goal  motives  ranging  from  autonomous  to 
controlled  (Deci  &  Ryan,  2000).  Our  research  herein  suggests  that 
these  two  frameworks  should  be  thought  of  in  integrative  rather 
than  comparative  terms:  Achievement  goals,  reasons  for  goal 
pursuit,  and  achievement  goal  complexes  all  make  independent 
contributions  to  experiential  and  self-regulated  learning  outcomes 
in  achievement  settings.  In  our  view,  conceptualizing,  operation¬ 
alizing,  and  empirically  analyzing  both  the  direction  and  energi¬ 
zation  of  goal  striving  using  both  of  these  theoretical  frameworks 
offers  the  most  promising  avenue  for  a  full  and  complete  account 
of  competence  motivation. 
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Appendix 

Achievement  Goal  Questionnaire,  Autonomous  and  Controlled  Reasons  Scale,  and  Autonomous 
and  Controlled  Achievement  Goal  Complex  Scale  (Study  4) 


The  first  scale  contains  mastery  goal  (MAp)  and  performance 
approach  goal  (PAp)  items,  the  second  scale  contains  autonomous 
reasons  (AR)  and  controlled  reasons  (CR)  items,  and  the  third 
scale  represents  autonomous  mastery  goal  complex  (MAp  X  AR), 
controlled  mastery  goal  complex  (MAp  X  CR),  autonomous  per¬ 
formance  goal  complex  (PAp  X  AR),  and  controlled  performance 
goal  complex  (PAp  X  CR)  items. 

Below  you  will  find  statements  that  represent  descriptions  of 
how  you  might  pursue  goals  in  your  classes  at  the  university. 

Please  indicate  how  true  each  statement  is  for  you. 

My  aim  is  to  completely  master  the  material  presented  in  my 
classes  (MAp). 

My  goal  is  to  perform  better  than  the  other  students.  (PAp) 

My  goal  is  to  learn  as  much  as  possible.  (MAp) 

My  aim  is  to  perform  well  relative  to  other  students.  (PAp) 

Below  you  will  find  statements  that  represent  explanations  for 
why  you  might  pursue  goals  in  your  classes  at  the  university. 

Please  indicate  how  true  each  statement  is  for  you. 

In  my  classes,  I  pursue  goals  because  I  find  them  highly  stim¬ 
ulating  and  challenging.  (AR) 

In  my  classes,  I  pursue  goals  because  I  find  them  personally 
valuable  goals.  (AR) 

In  my  classes,  I  pursue  goals  because  I  would  feel  bad,  guilty, 
or  anxious  if  I  didn’t  do  it.  (CR) 

In  my  classes,  I  pursue  goals  because  I  can  only  be  proud  of 
myself  if  I  do  so.  (CR) 

In  my  classes,  I  pursue  goals  because  I  have  to  comply  with  the 
demands  of  others  such  as  parents,  friends,  and  teachers.  (CR) 

In  my  classes,  I  pursue  goals  because  others  will  reward  me  only 
if  I  achieve  these  goals.  (CR) 

Below  you  will  find  statements  that  represent  descriptions  of 
how  you  might  pursue  goals  in  your  classes  at  university,  to¬ 
gether  with  explanations  for  why  you  might  pursue  them.  Please 
read  each  statement  carefully,  and  indicate  how  true  each  of  it  is 
for  you. 

My  goal  is  to  learn  as  much  as  possible  because  I  find  this  a 
highly  stimulating  and  challenging  goal.  (MAp  X  AR) 

My  aim  is  to  completely  master  the  material  presented  in  my 
classes  because  I  would  feel  bad,  guilty,  or  anxious  if  I  didn’t  do 
it.  (MAp  X  CR) 

My  goal  is  to  learn  as  much  as  possible  because  I  can  only  be 
proud  of  myself  if  I  do  so.  (MAp  X  CR) 

My  aim  is  to  completely  master  the  material  presented  in  my 
classes  because  I  find  this  a  personally  valuable  goal.  (MAp  X 
AR) 

My  goal  is  to  learn  as  much  as  possible  because  I  have  to 
comply  with  the  demands  of  others  such  as  parents,  friends,  and 
teachers.  (MAp  X  CR) 


My  aim  is  to  completely  master  the  material  presented  in  my 
classes  because  others  will  reward  me  only  if  I  achieve  this  goal. 
(MAp  X  CR) 

My  aim  is  to  completely  master  the  material  presented  in  my 
classes  because  I  find  this  a  highly  stimulating  and  challenging 
goal.  (MAp  X  AR) 

My  goal  is  to  learn  as  much  as  possible  because  I  would  feel 
bad,  guilty,  or  anxious  if  I  didn’t  do  it.  (MAp  X  CR) 

My  aim  is  to  completely  master  the  material  presented  in  my 
classes  because  I  can  only  be  proud  of  myself  if  I  do  so.  (MAp  X 
CR) 

My  goal  is  to  learn  as  much  as  possible  because  I  find  this  a 
personally  valuable  goal.  (MAp  X  AR) 

My  aim  is  to  completely  master  the  material  presented  in  my 
classes  because  I  have  to  comply  with  the  demands  of  others  such 
as  parents,  friends,  and  teachers.  (MAp  X  CR) 

My  goal  is  to  learn  as  much  as  possible  because  others  will 
reward  me  only  if  I  achieve  this  goal.  (MAp  X  CR) 

My  goal  is  to  perform  better  than  the  other  students  because  I 
find  this  a  highly  stimulating  and  challenging  goal.  (PAp  X  AR) 
My  aim  is  to  perform  well  relative  to  other  students  because  I 
would  feel  bad,  guilty,  or  anxious  if  I  didn’t  do  it.  (PAp  X  CR) 
My  goal  is  to  perform  better  than  the  other  students  because  I 
can  only  be  proud  of  myself  if  I  do  so.  (PAp  X  CR) 

My  aim  is  to  perform  well  relative  to  other  students  because  I 
find  this  a  personally  valuable  goal.  (PAp  X  AR) 

My  goal  is  to  perform  better  than  the  other  students  because  I 
have  to  comply  with  the  demands  of  others  such  as  parents, 
friends,  and  teachers.  (PAp  X  CR) 

My  aim  is  to  perform  well  relative  to  other  students  because 
others  will  reward  me  only  if  I  achieve  this  goal.  (PAp  X  CR) 
My  aim  is  to  perform  well  relative  to  other  students  because  I 
find  this  a  highly  stimulating  and  challenging  goal.  (PAp  X  AR) 
My  goal  is  to  perform  better  than  the  other  students  because  I 
would  feel  bad,  guilty,  or  anxious  if  I  didn’t  do  it.  (PAp  X  CR) 
My  aim  is  to  perform  well  relative  to  other  students  because  I 
can  only  be  proud  of  myself  if  I  do  so.  (PAp  X  CR) 

My  goal  is  to  perform  better  than  the  other  students  because  I 
find  this  a  personally  valuable  goal.  (PAp  X  AR) 

My  aim  is  to  perform  well  relative  to  other  students  because  I 
have  to  comply  with  the  demands  of  others  such  as  parents, 
friends,  and  teachers.  (PAp  X  CR) 

My  goal  is  to  perform  better  than  the  other  students  because 
others  will  reward  me  only  if  I  achieve  this  goal.  (PAp  X  CR) 
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Identifying  Pre-High  School  Students’  Science  Class  Motivation  Profiles  to 
Increase  Their  Science  Identification  and  Persistence 
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One  purpose  of  this  study  was  to  determine  whether  patterns  existed  in  pre-high  school  students’ 
motivation-related  perceptions  of  their  science  classes.  Another  purpose  was  to  examine  the  extent  to 
which  these  patterns  were  related  to  their  science  identification,  gender,  grade  level,  class  effort,  and 
intentions  to  persist  in  science.  We  collected  data  from  pre-high  school  students  (Grades  5  through  7, 
52.5%  female,  and  90.7%  who  self-identified  as  White)  from  2  rural  public  schools  in  Southwest 
Virginia.  Our  analysis  included  data  from  937  questionnaires  that  measured  students’  perceptions  of 
empowerment/autonomy,  usefulness/utility  value,  expectancy  for  success,  situational  interest,  and  caring 
in  science  class.  Using  cluster  analysis,  we  identified  5  clusters  (i.e.,  “motivation  profiles”)  of  students: 
(a)  low  motivation,  (b)  low  usefulness  and  interest  but  high  success  and  caring,  (c)  somewhat  high 
motivation,  (d)  somewhat  high  motivation  and  high  success  and  caring,  and  (e)  high  motivation.  We 
tested  the  cluster  stability  by  cluster  analyzing  subsamples  by  year  of  data  collection  and  by  grade  level. 
Significant  relationships  existed  between  these  motivation  profiles  and  students’  science  identification, 
gender,  grade  level,  science  class  effort,  and  intentions  to  persist  in  science.  These  findings  may  support 
science  educators  in  targeting  students  with  similar  motivation  profiles  rather  than  adhering  to  the 
difficult  and  often  unrealistic  task  of  catering  to  each  student’s  complex  needs,  individually. 

Keywords:  cluster  analysis,  motivation,  motivation  profiles,  person-centered  research,  science  education 


The  overall  aim  of  this  study  was  to  better  understand  how 
pre-high  school  (i.e.,  grades  five  through  seven)  students’  percep¬ 
tions  of  their  science  classes  can  affect  their  science  identification 
and  intentions  to  persist  in  science-related  fields.  Persistence  in 
science  is  important  because  finding  well-educated  and  trained 
professionals  to  fill  science,  technology,  engineering,  and  mathe¬ 
matics  (STEM)  positions  is  a  national  concern  in  the  United  States 
(Smith,  2012),  in  part  because  research  and  funding  in  STEM 
fields  is  integral  to  US  prosperity  (National  Academy  of  Sciences 
[NAS],  2007).  Providing  educational  experiences  that  support  stu¬ 
dents’  success  in  science  and  mathematics  is  critical  to  ensuring 
that  they  are  adequately  prepared  for  STEM  professions  (NAS, 
2007;  NGSS  Lead  States,  2013;  President’s  Council  of  Advisors 
on  Science  and  Technology  [PCAST],  2012;  Smith,  2012).  Un¬ 
fortunately,  students  are  increasingly  entering  college  underpre¬ 
pared  for  and  uninterested  in  pursuing  STEM  fields  (Osborne, 
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Simon,  &  Collins,  2003;  PCAST,  2012),  which  has  led  researchers 
to  study  factors  that  can  affect  students’  motivation  and  intention 
to  persist  in  STEM  disciplines  (Renninger,  Nieswandt,  &  Hidi, 
2015). 

Researchers  have  documented  that  students’  motivation  in  sci¬ 
ence  tends  to  wane  with  age  (Osborne  et  al.,  2003;  Simpson  & 
Oliver,  1990).  For  students’  long-term  persistence  in  STEM  fields, 
it  is  especially  important  to  nurture  their  motivation  and  interest  in 
science  prior  to  eighth-grade,  particularly  during  the  pre-high 
school  years  (Maltese  &  Tai,  2010;  PCAST,  2010,  2012;  Tai,  Liu, 
Maltese,  &  Fan,  2006).  The  pre-high  school  years  are  also  critical 
because  students  who  intend  to  persist  in  the  sciences  typically 
begin  their  formal  preparation  during  that  time  (NAS,  2007). 
Fortunately,  school  climate  and  science  teaching  methods  during 
the  pre-high  school  years  can  positively  impact  students’  science 
motivation  and  persistence,  which  can  help  to  prevent  the  declines 
that  have  been  found  in  less  supportive  environments  (Chittum, 
Jones,  Akalin,  &  Schram,  under  review;  Fortus  &  Vedder-Weiss, 
2014;  Vedder-Weiss  &  Fortus,  2011). 

Given  these  findings,  we  were  interested  in  how  pre-high 
school  students’  perceptions  of  their  science  classes  were  related  to 
their  science  identification  (i.e.,  the  extent  to  which  a  student 
values  science  as  an  important  part  of  his  or  her  “self’;  Jones,  Ruff, 
&  Osborne,  2015),  because  when  students  identify  with  a  subject, 
they  are  more  likely  to  persist  in  the  subject  in  the  future  (Osborne 
&  Jones,  2011).  Therefore,  identifying  students’  perceptions  of 
science  class  that  affect  their  science  identification  may  lead  to 
strategies  that  teachers  can  use  to  foster  students’  science  identi¬ 
fication  and  increase  their  persistence  in  science  over  time.  The¬ 
oretical  and  empirical  findings  (e.g.,  Hidi,  Renninger,  & 
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Nieswandt,  2015;  Jones,  Ruff,  et  al.,  2015;  Osborne  &  Jones, 
2011)  indicate  that  teaching  strategies  that  support  students’  sci¬ 
ence  identification  include  those  that  are  consistent  with  the  five 
components  of  the  MUSIC®  Model  of  Motivation  (MUSIC  model; 
Jones,  2009,  2015):  eA/powerment,  Usefulness,  Success,  /nterest, 
and  Caring  (MUSIC  is  an  acronym  for  the  first  sounds  of  these 
words).  Consequently,  we  chose  to  focus  on  students’  MUSIC 
model  perceptions  of  their  science  classes. 

We  were  particularly  interested  in  examining  whether  patterns 
existed  in  students’  science  class  MUSIC  perceptions.  For  exam¬ 
ple,  students  may  feel  empowered  in  their  science  class  (high 
empowerment),  understand  the  usefulness  of  their  work  in  that 
science  class  (high  usefulness),  and  feel  that  they  can  be  successful 
in  that  science  class  (high  success).  Yet,  they  may  not  believe  that 
the  science  classwork  is  interesting  (low  interest)  or  that  their 
teacher  cares  about  their  learning  (low  caring).  If  several  students 
have  a  similar  pattern  of  these  five  science  class  perceptions,  it 
may  be  possible  for  teachers  to  motivate  these  students  by  tailoring 
their  instructional  strategies  to  this  pattern.  In  this  study,  we  used 
cluster  analysis,  which  is  a  person-centered  research  approach 
(Bergman,  2001),  to  determine  whether  these  types  of  patterns 
exist  for  students  in  science  classes;  and,  if  they  do,  to  identify  the 
number  and  type  of  patterns. 

Another  puipose  of  this  study  was  to  investigate  how  patterns  in 
students’  science  class  perceptions  relate  to  their  gender,  grade 
level,  class  effort,  and  intentions  to  persist  in  science.  Previous 
research  suggests  that  gender  can  be  an  important  factor  in  stu¬ 
dents’  motivation  and  persistence  in  STEM  fields,  with  female 
students  often  less  likely  to  be  motivated  and  persist  (Eccles,  2007; 
Maltese  &  Harsh,  2015).  Moreover,  older  students  are  often  less 
motivated  than  younger  students  (Eccles,  Wigfield,  Harold,  & 
Blumenfeld,  1993;  Jacobs,  Lanza,  Osgood,  Eccles,  &  Wigfield, 
2002).  Finally,  we  examined  students’  perceived  effort  in  class  and 
their  reported  intentions  to  persist  (i.e.,  science  course  intentions 
and  science  career  goals)  because  research  suggests  that  more 
motivated  students  often  put  forth  more  effort  and  tend  to  persist 
in  related  tasks  or  domains  (Deci  &  Ryan,  2000;  Hidi  &  Ren- 
ninger,  2006;  Wigfield  &  Eccles,  2000). 

Conceptual  Framework 
Science  Identification 

Students  value  some  academic  subjects  more  than  others,  and 
their  values  for  these  subjects  can  change  over  time  (Simpkins, 
Davis-Kean,  &  Eccles,  2006).  The  extent  to  which  a  student  values 
a  subject  as  an  important  part  of  his  or  her  “self’  is  defined  as 
domain  identification  (Jones,  Ruff,  et  al.,  2015;  Osborne  &  Jones, 
2011).  A  domain  can  refer  to  a  broader  category  (e.g.,  academics 
or  athletics)  or  a  narrower  category  (e.g.,  science  or  mathematics). 
Domain  identification  is  important  because  it  is  associated  with 
several  positive  outcomes,  such  as  classroom  participation  and 
achievement  (Voelkl,  1997),  deep  cognitive  processing  of  course 
material  and  self-regulation  (Osborne  &  Rausch,  2001),  grade 
point  average  and  academic  honors  (Osborne,  1997),  decreased 
behavioral  referrals  and  absenteeism  (Osborne  &  Rausch,  2001), 
and  career  goals  (Jones,  Osborne,  Paretti,  &  Matusovich,  2014; 
Jones,  Paretti,  Hein,  &  Knott,  2010;  Jones,  Tendhar,  &  Paretti, 
2016). 


Figure  1  shows  how  the  variables  in  the  present  study  fit  into  the 
domain  identification  model  presented  by  Osborne  and  Jones 
(201 1).  The  left  side  of  the  figure  shows  the  social  and  academic 
background  factors  that  can  affect  students’  science  identification, 
including  their  science  class  perceptions  of  empowerment,  useful¬ 
ness,  success,  interest,  and  caring.  The  other  parts  of  the  figure 
show  that  science  identification  affects  and  is  affected  by  students 
science  career  goals  and  science  course  intentions.  These  factors 
then  affect  students’  science  class  effort  and  science  outcomes 
(e.g.,  grades,  achievement),  whiqh  then  cycle  back  and  affect  the 
other  variables  in  the  model.  Studies  using  structural  equation 
modeling  have  confirmed  the  relationships  of  several  aspects  of 
the  model  in  the  domain  of  engineering  (Jones,  Osborne,  et  al., 
2014;  Jones,  Tendhar,  &  Paretti,  2016).  Furthermore,  Jones,  Ruff, 
et  al.  (2015)  cited  evidence  from  studies  conducted  with  students 
in  science  and  mathematics  to  demonstrate  connections  between 
students’  class  perceptions  of  the  MUSIC  components  and  their 
identification  with  science  and  mathematics. 

It  is  important  to  note  that  the  MUSIC  model  focuses  on 
students’  perceptions  within  a  specific  learning  environment,  such 
as  a  science  class  or  a  specific  learning  task.  In  contrast,  domain 
identification  focuses  on  students’  identification  at  a  broader  do¬ 
main  level,  such  as  science. 

The  MUSIC  Model  of  Motivation 

Based  on  an  extensive  examination  of  motivation-related  re¬ 
search,  Jones  (2009,  2015,  2016a)  developed  the  multidimensional 
MUSIC®  Model  of  Motivation  to  help  teachers  identify  and  im¬ 
plement  teaching  strategies  consistent  with  current  motivation 
research.  The  MUSIC  model  helps  to  fill  a  need  for  integrative 
models  of  motivation  (Vansteenkiste  &  Mouratidis,  2016;  Wentzel 
&  Wigfield,  2009).  The  model  organizes  motivation-related  in¬ 
structional  strategies  into  five  broad  categories  and  includes  strat¬ 
egies  that:  (a)  empower  students  by  giving  them  some  control  over 
their  environment,  (b)  demonstrate  how  the  topic  is  useful  to 
students’  personal  goals,  (c)  help  students  believe  that  they  can 
succeed,  (d)  trigger  and  maintain  students’  situational  interest  in 
the  topic,  and  (e)  foster  a  sense  of  caring  and  belonging. 

The  MUSIC  model  has  also  been  used  to  guide  the  assessment 
of  students’  motivation-related  perceptions  associated  with  a  par- 


Figure  1.  Relationships  among  the  variables.  Variables  measured  in  this 
study  are  bulleted  and  bolded.  From  “Overview  of  the  MUSIC®  Model  of 
Motivation”  by  B.  D.  Jones,  2017,  p.  7.  Copyright  2017  by  Brett  D.  Jones. 
Reprinted  with  permission. 
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ticular  class  to  examine  the  effectiveness  of  instructional  approaches 
(e.g.,  Chittum,  McConnell,  &  Sible,  in  press;  Jones,  Outturn,  et  al., 
2015;  Jones,  Epler,  Mokri,  Bryant,  &  Paretti,  2013;  Jones,  Ruff, 
Snyder,  Petrich,  &  Koonce,  2012;  Jones,  Watson,  Rakes,  &  Akalin’ 
2012;  Lee,  Kajfez,  &  Matusovich,  2013;  McGinley  &  Jones,  2014) 
and  the  relationships  between  these  perceptions  and  outcomes,  such  as 
domain  identification,  course  effort,  course  ratings,  and  career  goals 
(e.g.,  Jones,  2010;  Jones,  Osborne,  et  al.,  2014;  Jones,  Tendhar,  & 
Paretti,  2016).  Researchers  have  documented  that  students’  class 
perceptions  of  the  five  MUSIC  model  components  are  distinct,  yet 
correlated,  in  samples  of  elementary  students  (Jones  &  Sigmon, 
2016),  middle  and  high  school  students  (Parkes,  Jones,  &  Wilkins, 
2015),  and  college  students  (Jones  &  Skaggs,  2016;  Jones  &  Wilkins, 
2013).  Similar  patterns  have  been  shown  to  exist  across  cultures  and 
countries  (Jones,  Li,  &  Cruz,  2017;  Mohamed,  Soliman,  &  Jones, 
2013;  Mora,  Anorbe-Diaz,  Gonzalez-Marrero,  Martin-Gutierrez,  & 
Jones,  in  press;  Schram  &  Jones,  2016). 

We  chose  to  examine  students’  science  class  MUSIC  percep¬ 
tions  in  our  study  because  they  have  been  associated  with  domain 
identification  (Jones,  Osborne,  et  al„  2014;  Jones,  Ruff,  et  al., 
2015;  Jones,  Tendhar,  &  Paretti,  2016;  Osborne  &  Jones,  2011) 
and  for  a  few  other  reasons.  First,  the  MUSIC  model  components 
relate  to  well-known  constructs  that  have  been  studied  over  several 
decades  and  have  been  shown  to  be  associated  with  students’  class 
motivation  and  engagement  (as  explained  in  subsequent  sections). 
Second,  these  constructs  have  been  shown  to  be  changeable  by  an 
instructor  in  a  learning  environment  (Reeve,  Jang,  Carrell,  Jeon,  & 
Barch,  2004;  Turner,  Christensen,  Kackar-Cam,  Trucano,  &  Ful¬ 
mer,  2014;  Wang  &  Eccles,  2013),  which  is  important  because  the 
constructs  and  associated  instructional  strategies  would  not  other¬ 
wise  be  useful  to  instructors  who  want  to  increase  students’  mo¬ 
tivation  and  engagement  in  a  class.  Third,  we  wanted  to  assess 
enough  constructs  that  would  allow  us  to  explain  an  adequate 
amount  of  variance  in  educational  outcomes,  but  not  too  many 
constructs  that  would  result  in  the  inclusion  of  constructs  that 
overlapped  significantly  in  definition  (e.g.,  self-efficacy  and  ex¬ 
pectancy  for  success).  The  components  of  the  MUSIC  model  met 
this  criterion  because  researchers  have  documented  that  the  com¬ 
ponents  are  correlated,  yet  distinct  (Jones  et  al.,  2017;  Jones  & 
Skaggs,  2016;  Jones  &  Wilkins,  2013;  Parkes  et  al.,  2015).  In  the 
following  sections,  we  provide  further  description  of  each  of  the 
MUSIC  model  components. 

Empowerment.  The  empowerment  component  of  the  MUSIC 
model  refers  to  teaching  strategies  that  provide  students  with  the 
opportunity  to  become  autonomous  learners  by  encouraging  per¬ 
ceptions  of  choice,  freedom,  and  volition  (Jones,  2009).  By  em¬ 
powering  students,  instructors  can  meet  their  need  for  autonomy, 
which  “encompasses  people’s  strivings  to  be  agentic,  to  feel  like 
the  ‘origin’  (deCharms,  1968)  of  their  actions,  and  to  have  a  voice 
or  input  in  determining  their  own  behavior”  (Deci  &  Ryan,  1991, 
p.  243).  Empowering  students  can  meet  students’  psychological 
need  for  autonomy  and  is  consistent  with  the  tenets  of  self- 
determination  theory  (Deci  &  Ryan,  2000).  In  the  domain  of 
science,  students  who  have  been  provided  with  autonomy  have 
reported  higher  levels  of  intrinsic  motivation  (Berger  &  Hanze, 
2009),  interest  experience  (Tsai,  Kunter,  Ludtke,  Trautwein,  & 
Ryan,  2008),  interest  in  science  (Bulunuz,  &  Jarrett,  2015;  Xu, 
Coats,  &  Davidson,  2012),  and  engagement  (Hafen  et  al.,  2012), 
all  of  which  can  contribute  to  students’  science  identification. 


Teachers  can  empower  students  by  giving  them  some  control 
over  their  learning  environment  through  offering  meaningful 
choices  (e.g.,  choices  of  topics  and  team  members),  offering  op¬ 
portunities  for  students  to  make  decisions  in  the  learning  environ¬ 
ment  (e.g.,  lesson  pace),  and  in  welcoming  students’  opinions 
(Jones,  2009).  In  addition,  it  is  important  to  communicate  that 
students  have  an  action  choice  or  the  ability  to  decide  to  be 
autonomous  or  to  fully  endorse  relinquishing  control  to  another 
(Reeve,  Nix,  &  Hamm,  2003). 

Usefulness.  The  usefulness  component  of  the  MUSIC  model 
includes  instructional  strategies  that  encourage  students  to  per¬ 
ceive  their  classwork  (e.g.,  assignments,  activities)  as  useful  for 
their  short-  or  long-term  goals  (Jones,  2009,  2015).  The  usefulness 
component  is  consistent  with  the  utility  value  construct  in 
expectancy-value  theory  (Eccles  et  al.,  1983;  Eccles  &  Wigfield, 
1 995).  As  explained  by  Wigfield  and  Eccles  (2000),  “Utility  value 
or  usefulness  refers  to  how  a  task  fits  into  an  individual’s  future 
plans”  (p.  72). 

Perceptions  that  learning  tasks  are  useful  or  instrumental  in 
achieving  academic  and  personal  goals  can  positively  affect  do¬ 
main  identification  (Jones,  Osborne,  et  al.,  2014;  Jones,  Tendhar, 
&  Paretti,  2016)  and  many  constructs  closely  related  to  science 
identification,  including  interest  (Nieswandt  &  Shanahan,  2008; 
Reynolds,  Mehalik,  Lovell,  &  Schunn,  2009),  motivation  (Simons, 
Vansteenkiste,  Lens,  &  Lacante,  2004),  persistence  (De  Voider  & 
Lens,  1982;  Miller,  Greene,  Montalvo,  Ravindran,  &  Nichols, 
1996;  Simons  et  al.,  2004),  engagement  (Miller  et  al.,  1996; 
Simons  et  al.,  2004),  effort  (De  Voider  &  Lens,  1982;  Miller  et  al., 
1996;  Simons  et  al.,  2004),  and  intention  to  study  in  a  specific  field 
(Jones  et  al.,  2010).  To  support  students’  perceptions  of  usefulness 
in  an  educational  environment,  instructors  can:  design  tasks  and 
activities  that  relate  to  students’  long-term  goals;  connect  content, 
routines,  and  strategies  to  the  real  world  through  rationales  and  by 
defining  real-life  implications;  implement  experiential,  hands-on 
learning;  and  incorporate  personally  relevant  topics  (Hulleman, 
Durik,  Schweigert,  &  Harackiewicz,  2008;  Jones,  2009;  e.g., 
Jones,  Chittum,  et  al.,  2015). 

Success.  The  success  component  of  the  MUSIC  model  in¬ 
cludes  teaching  strategies  that  support  students’  perceptions  that 
they  can  succeed  if  they  put  forth  the  appropriate  effort  (Jones, 
2009,  2015).  This  component  is  consistent  with  constructs  such  as 
expectancy  for  success  (Wigfield  &  Eccles,  2000),  competence 
motivation  (Elliot  &  Dweck,  2005;  White,  1959),  the  psycholog¬ 
ical  need  for  competence  (Deci  &  Ryan,  2000,  2012),  and  self- 
efficacy  (Bandura,  1986). 

Researchers  have  related  high  expectancies  for  success  and 
competence  beliefs  to  several  constructs  associated  with  science 
identification,  including  intentions  to  persist  in  science  (e.g.,  pur¬ 
sue  science  careers,  courses,  and  tasks;  DeB acker  &  Nelson,  2000; 
Ireson  &  Hallam,  2005;  Rudasill  &  Callahan,  2010;  Simpkins  et 
al.,  2006);  increased  engagement  (Hoffmann,  2002;  Scogin  & 
Stuessy,  2015);  higher  performance  (Hoffmann,  2002);  increased 
strategy  use  (Cheung,  2015);  and  positive  affect  for  science  (De- 
Backer  &  Nelson,  2000;  Hoffmann,  2002).  In  studies  examining 
the  relationships  between  the  MUSIC  model  components  and 
domain  identification  (e.g.,  Jones,  Osborne,  et  al.,  2014;  Jones, 
Tendhar,  &  Paretti,  2016),  researchers  have  often  equated  the 
success  component  of  the  MUSIC  model  to  the  expectancy  for 
success  construct  (Eccles  et  al.,  1983),  which  has  been  shown  to 
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contribute  to  students’  level  of  domain  identification.  These  stud¬ 
ies  have  measured  expectancy  for  success  (as  opposed  to  self- 
efficacy,  competence,  or  self-concept)  in  part  because  students’ 
MUSIC  perceptions  have  been  assessed  at  the  class  level  as 
opposed  to  the  task  level,  which  would  be  appropriate  for  mea¬ 
suring  the  self-efficacy  construct  (Bong  &  Skaalvik,  2003)..  Fur¬ 
thermore,  students’  ratings  of  expectancy  for  success  have  not 
been  shown  to  be  empirically  distinct  from  their  ratings  of  ability, 
competence,  and  self-concept  (Eccles  &  Wigfield,  1995;  Eccles  et 
al.,  1993);  therefore,  it  is  redundant  to  assess  more  than  one  of 
these  constructs. 

Teachers  can  support  students’  success  perceptions  in  a  variety 
of  ways,  such  as  by  providing:  attainably  challenging  tasks  and 
learning  goals;  clear  and  realistic  expectations;  meaningful,  timely, 
and  constructive  feedback  that  can  be  implemented  and  is  appli¬ 
cable  to  future  learning;  opportunities  for  success  if  students  put 
forth  effort;  and  opportunities  to  practice  and  master  concepts 
(Jones,  2009).  Teachers  can  also  foster  malleable  beliefs  about 
intelligence,  teach  lessons  considering  novice  versus  expert  under¬ 
standings,  and  break  difficult  tasks  into  attainable  chucks  to  nur¬ 
ture  positive  ability  perceptions  (Jones,  2009,  2015). 

Interest.  The  interest  component  of  the  MUSIC  model  per¬ 
tains  to  instructional  strategies  that  stimulate  interest  in  the  aca¬ 
demic  activity,  content,  or  domain  (Jones,  2009).  Interest  can  be 
defined  as  “liking  and  willful  engagement  in  a  cognitive  activity” 
(Schraw  &  Lehman,  2001,  p.  23);  therefore,  it  includes  both  an 
affective  component  of  positive  emotion  and  a  cognitive  compo¬ 
nent  of  concentration  (Hidi  &  Renninger,  2006).  Students’  inter¬ 
ests  can  progress  along  a  continuum  in  which  triggered  situational 
interest  (which  is  short-term  and  context-specific)  can  lead  to 
well-developed  individual  interest  (which  is  more  enduring  than 
situational  interest;  Hidi  &  Renninger,  2006).  Because  the  intent  of 
the  present  study  was  to  investigate  students’  perceptions  of  their 
current  science  class,  we  focused  on  their  situational  interest  rather 
than  their  longer-term  individual  interests.  Our  rationale  was  that, 
regardless  of  students’  level  of  individual  interest  in  science, 
instructors  can  strive  to  design  instruction  that  is  situationally 
interesting  to  students.  In  addition,  situational  interest  is  a  neces¬ 
sary  condition  for  the  development  of  individual  interest  (Hidi  & 
Renninger,  2006),  which  is  similar  in  many  ways  to  domain 
identification  (see  Jones,  Ruff,  et  al.,  2015  for  a  discussion).  Thus, 
if  teachers  can  trigger  and  maintain  students’  situational  interest, 
they  may  be  able  to  develop  students’  individual  interest  and 
identification  in  the  domain. 

Situational  interest  is  consistent  with  constructs  such  as  intrinsic 
motivation  (Deci,  1975),  intrinsic  interest  value  (Eccles  &  Wig- 
field,  1995),  and  flow  (Csikszentmihalyi,  1990),  and  can  influence 
a  variety  of  factors,  including  engagement,  attention,  persistence, 
goals,  strategy  use,  enjoyment,  and  performance  (Hidi  &  Harack- 
iewicz,  2000;  Hidi  &  Renninger,  2006;  Schraw  &  Lehman,  2001 ). 
In  science,  situational  interest  has  been  associated  with  continued 
engagement  in  science  activities  (Spiegel,  McQuillan,  Halpin, 
Matuk,  &  Diamond,  2013)  and  more  motivation  to  learn  science 
(Barak,  Ashkar,  &  Dori,  2011;  Rosen,  2009).  Teachers  can  stim¬ 
ulate  situational  interest  in  many  ways,  such  as  by  inciting  curi¬ 
osity  and/or  strong  emotions,  introducing  novelty,  using  a  variety 
of  instructional  tools  and/or  tasks,  facilitating  social  interaction, 
connecting  content  to  background  knowledge  and  prior  experi¬ 
ences,  and  using  humor  (Bergin,  1999;  Hidi  et  al.,  2015). 


Caring.  The  caring  component  of  the  MUSIC  model  includes 
instructional  strategies  aimed  to  help  students  believe  that  their 
instructors  and  classmates  care  about  their  learning  and  general 
well-being  (Jones,  2009,  2015).  The  caring  component  of  the 
MUSIC  model  is  consistent  with  constructs  such  as  caring  (Nod- 
dings,  1992),  belonging  (Baumeister  &  Leary,  1995),  relatedness 
(Deci  &  Ryan,  2000,  2012),  and  attachment  (Ainsworth,  1973; 
Bowlby,  1969).  Positive  interactions  with  instructors  and  peers  can 
positively  influence  motivation-related  outcomes  (Wentzel,  1997; 
Wentzel,  Battle,  Russell,  &  Looney,  2010).  Furthermore,  when 
students  have  healthy,  secure  attachments  with  teachers,  parents, 
and  peers,  they  are  more  likely  to  experience  an  increase  in 
academic  performance,  academic  motivation,  emotional  develop¬ 
ment,  and  social  skill  development  (Bergin  &  Bergin,  2009). 
Specifically  in  the  domain  of  science,  when  students  perceive  care, 
support,  and/or  positive  relations  with  others  in  the  learning  envi¬ 
ronment,  they  are  more  likely  to  hold  positive  attitudes  about  and 
values  for  science  (Jen,  Lee,  Chien,  Hsu,  &  Chen,  2013),  to 
develop  their  science  identity  (Lee,  2002;  Stake  &  Nickens,  2005), 
and  to  intend  to  persist  in  science  (Jacobs,  Finken,  Griffin,  & 
Wright,  1998;  Stake  &  Nickens,  2005). 

Instructors  can  encourage  positive  perceptions  of  caring  and 
feelings  of  belonging  through  their  classroom  interactions  (Jones, 
2009).  In  Wentzel’s  (1997)  study,  students  described  caring  in¬ 
structors  as  those  who  emphasized  a  democratic  style,  respected 
the  individuality  of  students,  provided  positive  and  meaningful 
feedback,  and  went  the  “extra  mile”  in  teaching  and  planning. 
Caring  can  also  be  nurtured  by  supporting  students’  educational 
goals;  demonstrating  care  and  concern  for  achieving  learning  ob¬ 
jectives,  personal  goals,  and  well-being;  carefully  designing  in¬ 
struction  to  encourage  student  learning;  providing  opportunities 
for  positive  interactions  with  peers;  and  making  oneself  available 
for  academic  support  after  hours  (Jones,  2009,  2015). 

Gender  differences.  Male  and  female  students  have  been 
shown  to  differ  on  their  perceptions  of  some  of  the  MUSIC  model 
components  (Meece,  Glienke,  &  Burg,  2006).  For  example,  fe¬ 
males  tend  to  have  lower  expectancies  for  success  about  science- 
related  proclivities  (Bong,  Lee,  &  Woo,  2015).  In  another  study, 
females  perceived  high  school  physical  science  courses  to  be  less 
useful  than  did  the  male  students,  which  led  them  to  enroll  in  fewer 
physical  science  courses  (Eccles,  2007).  Also,  classroom  experi¬ 
ences  appear  to  have  more  of  an  effect  on  female  students’  inter¬ 
ests  in  science  than  for  males.  For  example,  in  one  study  male 
students  were  more  likely  to  report  that  their  interest  was  triggered 
from  building  or  tinkering  with  mechanical  objects  or  electronics, 
or  reading  books  and  magazines  (Maltese  &  Harsh,  2015).  These 
findings  suggest  that  teachers  may  be  able  to  play  an  especially 
important  role  in  helping  female  students  improve  their  percep¬ 
tions  of  success,  interest,  and  usefulness  in  science. 

t 

Person-Centered  Approaches 

Although  variable-centered  approaches  to  data  analysis  have 
been  effective  at  identifying  trends  in  students’  interests  and 
motivation-related  perceptions  in  science  over  time  (Simpson  & 
Oliver,  1990;  Simpkins  et  al.,  2006),  these  approaches  do  not  allow 
teachers  to  understand  how  students’  motivation  perceptions  in¬ 
teract  with  one  another  during  a  class  to  motivate  students. 
Variable-centered  approaches  investigate  the  effects  of  isolated 
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variables  linearly,  rather  than  “motivational  phenomena  as  contin¬ 
uously  emerging  systems  of  dynamically  interrelated  components” 
(Kaplan,  Katz,  &  Flum,  2014,  para.  4)  that  are  multidimensional, 
complex,  and  context-bound  (Turner  &  Meyer,  2000).  A  different 
approach  to  data  analysis,  often  named  a  person-centered  or  per¬ 
son  approach  (e.g.,  cluster  analysis),  allows  researchers  to  focus  on 
the  complex  natuie  of  the  individual,  and  motivation  constructs  are 
studied  as  dynamic,  interactionistic  processes  (Bergman,  2001; 
Kaplan,  Katz,  &  Flum,  2012;  Vansteenkiste  &  Mouratidis,  2016). 
Hence,  in  pei  son-centered  approaches,  the  variable  is  not  cen¬ 
tral,  rather,  the  person  is  central  because  the  dynamic  interplay 
of  multiple  variables  is  studied  by  investigating  patterns  and 
relationships  among  them  (Bergman,  2001).  Consequently,  in 
cluster  analysis,  for  example,  researchers  can  examine  a  more 
integrated  profile  of  the  individual  in  which  those  with  similar 
patterns  of  relationships  among  variables  are  organized  into  a 
cluster  or  “profile”  (Bergman,  2001;  Wormington,  Corpus,  & 
Anderson,  2012). 

We  used  cluster  analysis  in  the  present  study  to  identify  patterns 
in  students’  MUSIC  perceptions  of  their  science  classes.  Several 
recent  studies  have  used  cluster  analysis  to  examine  students’ 
profiles  at  the  school  or  academic  level  (Bowers  &  Sprott,  2012; 
Hayenga  &  Corpus,  2010;  Meece  &  Holt,  1993;  Ratelle,  Guay, 
Vallerand,  Larose,  &  Senecal,  2007;  Schwinger,  Steinmayr,  & 
Spinath,  2012;  Tuominen-Soini,  Salmela-Aro,  &  Niemivirta, 
2011;  Vansteenkiste,  Sierens,  Soenens,  Luyckx,  &  Lens,  2009; 
Wormington  et  al.,  2012),  at  the  domain  level  (Chen,  2012;  Con¬ 
ley,  2012;  Hartwell  &  Kaplan,  2014;  Turner,  Thorpe,  &  Meyer, 
1998),  at  the  class  level  (Daniels  et  al.,  2008;  Shell  &  Husman, 
2008;  Shell  &  Soh,  2013),  or  at  the  task  level  (Geiser,  Lehmann, 
&  Eid,  2006;  Janssen  &  Geiser,  2010).  However,  no  studies  have 
focused  on  pre-high  school  students  in  science  classes,  which  is 
the  population  of  interest  in  the  present  study. 

Research  Questions 

To  address  the  following  research  questions,  we  surveyed  fifth-, 
sixth-,  and  seventh-grade  students  about  their  science  class  per¬ 
ceptions,  science  identification,  science  class  effort,  and  intentions 
to  persist  in  science.  RQ1:  Can  students’  science  class  perceptions 
be  used  to  categorize  students  into  groups  with  similar  motivation 
profiles?  RQ2:  If  different  profiles  can  be  identified,  are  students’ 
class-related  motivation  profiles  associated  with  their  science  iden¬ 
tification,  science  class  effort,  and  intentions  to  persist  in  science? 
RQ3:  If  different  profiles  can  be  identified,  does  membership  in 
the  profiles  vary  by  students’  gender  or  grade  level?  We  expected 
that,  at  the  minimum,  a  high  motivation  profile  and  a  low  moti¬ 
vation  profile  would  emerge.  Our  hypothesized  “high”  motivation 
profile  would  include  students  who  rated  all  of  their  science  class 
MUSIC  perceptions  highly.  The  “low”  motivation  profile  would 
include  students  who  rated  all  of  their  class  MUSIC  perceptions 
low.  To  maintain  the  exploratory  nature  of  this  investigation  and  to 
avoid  preconceptions  that  could  influence  our  analysis,  we  did  not 
hypothesize  further  about  the  profiles.  A  primary  limitation  of 
cluster  analysis  is  that  researcher  judgment  may  unduly  influence 
the  cluster  solution  (Burns  &  Bums,  2008);  thus,  we  intentionally 
avoided  making  specific  hypotheses  regarding  the  cluster  solution 
so  as  to  approach  the  analysis  as  objectively  as  possible. 


We  further  hypothesized  that  students  in  a  high  motivation 
profile  would  put  forth  more  effort  in  science  class  and  report 
greater  intentions  to  persist  in  science  than  students  in  a  low 
motivation  profile.  In  addition,  we  hypothesized  that  students  in 
higher  grade  levels  would  be  in  lower  motivation  profiles  due  to 
commonly  reported  declines  in  motivation  over  time  (Eccles  et 
al.,  1993;  Jacobs  et  al.,  2002),  and  that  there  would  be  fewer 
female  students  in  higher  motivation  profiles  because  of  the 
gender  gap  associated  with  STEM  fields  (Meece  et  al.,  2006). 

The  results  of  this  study  may  be  used  to  help  educators  target 
students  with  similar  class-related  motivation  profiles,  rather  than 
adhere  to  the  difficult  and  often  unrealistic  task  of  catering  to  each 
student’s  individual  complex  needs.  Moreover,  it  may  be  possible 
to  identify  students  with  class-related  motivation  profiles  that  are 
more  or  less  likely  to  pursue  science-related  majors  or  careers. 
Using  profiles,  teachers  could  more  intentionally  target  students’ 
motivation  in  science  classrooms  and  increase  the  likelihood  that 
more  students  will  engage  in  science,  either  by  choosing  a  science- 
related  career  or  by  becoming  a  more  scientifically  literate  member 
of  society. 

Method 

Participants 

The  participants  were  students  in  grades  five,  six,  and  seven 
from  two  rural  public  schools  in  Southwest  Virginia.  We  collected 
data  at  three  time-points  and  received  responses  from  323  students 
in  2012  (84%  of  all  students  in  those  grades  at  the  schools),  320 
students  in  2013  (87%  of  all  possible  students),  and  291  students 
in  2014  (76%  of  all  possible  students).  This  sample  included  a  total 
of  934  completed  questionnaires  (some  students  completed  a  ques¬ 
tionnaire  for  two  or  three  years),  with  398  completed  question¬ 
naires  (178  students)  representing  students  assessed  at  multiple 
time  points  and  536  students  assessed  only  once.  Hereafter,  we 
refer  to  each  completed  questionnaire  as  one  “case.” 

The  majority  (90.7%)  of  the  students  identified  as  White  and  the 
others  identified  as  Black  or  African  American  (1.6%),  Hispanic 
(1.1%),  Asian  or  Pacific  Islander  (1.1%),  American  Indian  (2.6%), 
or  “other”  (2.7%),  and  two  students  chose  not  to  answer.  Slightly 
over  half  of  the  students  (52.5%)  were  female.  According  to  state 
guidelines,  both  schools  were  considered  to  comprise  a  high  pro¬ 
portion  of  low-income  students  and  qualified  for  federal  Title  I 
funds  (Virginia  Department  of  Education  [VDOE],  2012;  VDOE 
Office  of  School  Nutrition  Programs,  2014).  The  science  curricu¬ 
lum  in  the  fifth-  and  seventh-grades  comprised  earth/space,  life, 
and  physical  science  content.  In  sixth-grade,  the  science  curricu¬ 
lum  comprised  both  earth/space  science  and  physical  science.  Each 
school  contributed  approximately  half  of  the  cases  (47.7%  and 
52.3%). 

Procedures 

In  May  of  2012,  2013,  and  2014,  all  fifth-,  sixth-,  and  seventh- 
grade  students  present  at  two  K-7  schools  in  the  same  county 
completed  a  questionnaire  related  to  their  perceptions  about  sci¬ 
ence.  Students  had  been  enrolled  in  their  science  classes  since  the 
beginning  of  the  school  year  and  the  questionnaires  were  admin¬ 
istered  near  the  end  of  each  school  year.  The  schools  had  seven 
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science  teachers  in  the  three  grade  levels,  with  one  at  each  grade 
level  (except  for  fifth-grade  at  one  school,  which  added  a  second 
science  teacher  in  2014).  We  obtained  Institutional  Review  Board 
approval  prior  to  conducting  the  study. 

Measures 

The  questionnaire  was  titled  generically  as  a  “Science  Question¬ 
naire”  and  was  part  of  a  larger  study  that  examined  students’ 
motivation-related  perceptions  about  their  current  science  classes, 
their  motivation  beliefs  about  science,  and  their  demographic 
information.  The  items  were  scaled  using  a  6-point  Likert-type 
format  with  the  following  descriptors:  1  =  strongly  disagree,  2  = 
disagree,  3  =  mostly  disagree,  4  =  mostly  agree,  5  =  agree,  and 
6  =  strongly  agree. 

Science  class  perceptions.  To  develop  profiles  of  students’ 
perceptions  of  their  science  class,  we  measured  each  of  the  five 
components  of  the  MUSIC  model  using  the  MUSIC®  Model  of 
Academic  Motivation  Inventory  (MUSIC  Inventory;  Jones, 
2016b).  We  used  the  middle/high  school  version  of  the  MUSIC 
Inventory  (Jones,  2016b)  that  was  designed  to  measure  middle  and 
high  school  students’  science  class  perceptions  using  the  five 
MUSIC  model  components.  Table  1  shows  the  MUSIC  model 
components,  their  definitions,  and  the  related  constructs  in  the 
MUSIC  Inventory  (Jones,  2016b).  Although  the  MUSIC  Inventory 
was  designed  specifically  to  measure  students’  class  perceptions  of 
the  five  MUSIC  model  components,  it  is  noteworthy  that  (a)  it  is 
possible  that  the  inventory  does  not  measure  the  range  of  possible 
perceptions  within  each  MUSIC  component,  and  (b)  other  instru¬ 
ments  that  measure  the  MUSIC  model  components  might  focus  on 
different  aspects  of  the  components.  Nonetheless,  the  constructs 
measured  with  the  MUSIC  Inventory  have  been  shown  to  separate 
into  distinct  constructs  using  factor  analysis  (Jones  &  Skaggs, 
2016;  Parkes  et  ah,  2015;  Schram  &  Jones,  2016). 

Table  2  includes  example  items  from  each  MUSIC  Inventory 
scale.  Cronbach’s  alpha  values  for  the  MUSIC  Inventory  have 
been  shown  to  be  acceptable  for  fifth-  to  twelfth-grade  students  in 
music  and  band  ensemble  classes  (Parkes  et  ah,  2015;  empower¬ 
ment  a  =  .73,  usefulness  a  =  .86,  success  a  =  .92,  interest  a  = 
.91,  caring  a  =  .92)  and  for  the  students  in  the  present  study  (see 
Table  3).  For  this  study,  we  also  used  LISREL  8.8  to  conduct  three 
confirmatory  factor  analyses  (CFAs;  one  for  each  of  the  three 
years)  and  included  22  items:  the  18  items  from  the  five  MUSIC 
Inventory  scales  and  the  four  items  from  the  science  identification 
scale.  The  fit  indices  we  computed — the  Comparative  Fit  Index 
(CFI),  the  Standardized  Root  Mean  Square  Residual  (SRMR),  and 


the  Root  Mean  Square  Error  of  Approximation  (RMSEA)— were 
all  within  acceptable  limits  (see  Table  3;  Browne  &  Cudeck,  1993, 
Byrne,  2001;  Hu  &  Bentler,  1999;  Kline,  2005).  Thus,  the  CFAs 
documented  that  not  only  were  the  five  constructs  measured  by  the 
MUSIC  Inventory  distinct  but  also  that  these  five  constructs  were 
distinct  from  the  science  identification  construct. 

Science  identification.  We  measured  science  identification 
using  a  four-item  Identification  with  Science  scale  initially  based 
on  the  four-item  measure  used  by  Schmader,  Major,  and  Gramzow 
(2001 ;  a  =  .78)  and  used  in  the-domain  of  engineering  to  measure 
the  extent  to  which  engineering  students  identified  with  engineer¬ 
ing  (e.g.,  a  =  .84  and  0.89  in  Jones  et  al„  2010;  a  =  .92  in  Jones 
et  ah,  2014).  Table  2  includes  a  sample  item.  Scores  from  this 
measure  have  been  positively  related  to  a  variety  of  students’ 
beliefs,  including  career  goals  (Jones  et  ah,  2010,  2014;  Jones, 
Tendhar,  &  Paretti,  2016).  Cronbach’s  alpha  values  for  the  stu¬ 
dents  in  the  present  study  are  presented  in  Table  3. 

Science  class  effort.  We  measured  science  class  effort  with  a 
four-item  measure  used  by  Jones  (2010)  that  was  based  on  the 
Effort/Importance  Scale  that  is  part  of  the  Intrinsic  Motivation 
Inventory  (Plant  &  Ryan,  1985).  This  scale  measures  the  amount 
of  perceived  effort  that  students  put  forth  in  a  class.  See  Table  2  for 
an  example  item.  Cronbach’s  alpha  values  were  acceptable  in  the 
present  study  (2012  a  =  .87,  2013  a  =  .87,  2014  a  =  .85)  and 
have  been  shown  to  be  acceptable  in  past  studies  (a  =  .84,  .84,  .86, 
and  .84  in  Jones,  2010;  a  =  .95  in  Jones  et  ah,  2014). 

Science  course  intentions.  We  developed  one  item  to  mea¬ 
sure  students’  desire  to  enroll  in  more  science  courses  in  the 
future  (see  Table  2).  This  item  was  based  on  similar  items  used 
by  Hulleman  et  ah  (2008)  to  measure  students’  subsequent 
interest. 

Science  career  goals.  We  used  a  two-item  measure  of  science 
career  goals  that  was  based  on  similar  single  items  that  have  been 
used  to  measure  the  likelihood  that  students’  careers  would  di¬ 
rectly  relate  to  engineering  (e.g.,  Jones  et  ah,  2014;  Jones,  Tend¬ 
har,  &  Paretti,  2016),  which  serves  as  a  measure  of  intent  to  persist 
in  a  science-related  field.  A  sample  item  is  included  in  Table  2. 
These  items  have  been  associated  with  other  motivation-related 
constructs  in  ways  consistent  with  theories  (Jones  et  ah,  2014; 
Jones,  Tendhar,  &  Paretti,  2016).  In  this  study,  we  refer  to  both 
science  course  intentions  and  science  career  goals  as  measures  of 
students’  intentions  to  persist  in  science.  Cronbach’s  alpha  values 
were  acceptable  for  the  present  study,  2012  a  =  .83,  2013  a  =  .77, 
2014  a  =  .82. 


Table  1 

The  MUSIC  Model  Components,  Definitions,  and  Related  Constructs 


MUSIC 

component 

■ - - - - - 6 

Definitions 

The  degree  to  which  a  student  perceives  that: 

Related  constructs 

Empowerment 

Usefulness 

Success 

Interest 

Caring 

he  or  she  has  control  of  his  or  her  learning  environment 
the  classwork  is  useful  to  his  or  her  future 
he  or  she  can  succeed  at  the  classwork 

the  instructional  methods  and  classwork  are  interesting  or  enjoyable 

the  teacher  cares  about  whether  the  student  succeeds  in  the  classwork  and  cares  about  the  student’s  well-being 

autonomy 
utility  value 
expectancy  for  success 
situational  interest 
caring 

Note.  Based  on  Jones  (2016b). 
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Table  2 
Example  Items 


Construct 


Example  item 


No.  items 


Empowerment 

Usefulness 

Success 

Interest 

Caring 

Science  identification 
Science  class  effort 
Science  course  intentions 
Science  career  goals 


I  have  control  over  how  I  learn  the  content  in  science  class. 

In  general,  science  classwork  is  useful  to  me. 

During  science  class,  I  feel  that  I  can  be  successful  on  the  classwork. 
The  science  classwork  is  interesting  to  me. 

My  science  teacher  cares  about  how  well  I  do  in  science  class. 

Being  good  at  science  is  an  important  part  of  who  I  am. 

I  put  a  lot  of  effort  into  my  science  class. 

I  would  like  to  take  more  science  courses  in  the  future. 

My  future  career  will  involve  science. 


4 

3 

4 

3 

4 
4 
4 
1 
2 


Analysis 

Using  mean  scores  for  students’  science  class  perceptions  of 
each  MUSIC  model  component,  we  followed  a  two-step  clustering 
procedure  recommended  by  Bacher,  Wenzig,  and  Vogler  (2004) 
that  has  been  used  in  several  studies  (Hartwell  &  Kaplan,  2014; 
Huberty,  Jordan,  &  Brandt,  2005;  Vansteenkiste  et  ah,  2009; 
Wormington  et  ah,  2012):  (1)  hierarchical  agglomerative  analysis 
(following  Ward’s  method)  followed  by  (2)  k-means  analysis.  This 
process  includes  both  hierarchical  and  nonhierarchical  methods  to 
find  the  most  appropriate  cluster  fit,  in  which  the  second  analysis 
serves  to  more  effectively  demarcate  clusters  developed  in  the  first 
(Bacher  et  ah,  2004).  Using  SPSS  version  22  software,  we  first  ran 
a  hierarchical  analysis  to  determine  the  optimal  number  of  clusters 
and  preliminary  cluster  centers  (i.e.,  means  for  each  MUSIC 
model  variable  in  each  cluster;  Bums  &  Bums,  2008)  and  then 
k-means  analysis  for  validation  and  to  obtain  the  final  cluster 
centers  (i.e.,  means;  Mooi  &  Sarstedt,  2011).  As  an  ancillary 
purpose  of  this  paper,  we  included  a  detailed  description  of  our  use 
of  cluster  analysis  in  this  section  because  cluster  analysis  is  not  as 
widely  used  in  educational  research  as  many  other  methods. 

Hierarchical  agglomerative  cluster  analysis  is  a  process  through 
which  clusters  or  groups  form  when  individual  cases  (i.e.,  a  single 
student’s  five-dimensional  response,  which  includes  one  value  for 
each  of  the  five  MUSIC  components  measured)  are  amalgamated 
at  each  step  of  the  analysis  until,  at  the  final  step,  all  cases  combine 
into  one  large  cluster  (Bartholomew,  Steele,  Galbraith,  & 
Moustaki,  2008;  Kaufman  &  Rousseeuw,  1990).  The  researcher 
can  then  determine  at  which  stage  the  most  appropriate  number  of 
clusters  formed.  During  the  analysis,  cases  with  similar  responses 
are  amalgamated.  Initially,  each  case  represents  a  single-case  clus¬ 
ter,  which  generally  form  the  initial  cluster  centers.  Then,  at  each 


step,  every  case  or  cluster  is  compared  with  other  cases  or  clusters, 
and  pairings  are  selected  that  represent  the  least  amount  of  lost 
information  (i.e.,  the  least  sum  of  squares,  or  differences  from  the 
overall  cluster  center;  Bartholomew  et  al.,  2008).  Ward’s  method 
reduces  variance  within  clusters  by  computing  squared  Euclidean 
distances,  which  sums  the  squared  differences  across  every  vari¬ 
able  in  the  analysis  during  each  stage  (Norusis,  2011).  By  mini¬ 
mizing  the  distance  measures  from  each  case  and  the  cluster 
center,  cases  that  combine  into  the  same  cluster  are  more  similar 
than  those  assigned  to  other  clusters.  When  cases  are  amalgamated 
in  this  way,  the  foremost  consideration  is  (the  inevitable)  loss  of 
information;  the  focus  is  to  minimize  difference/dissimilarity  mea¬ 
sures  to  develop  fairly  homogeneous  groups  (Bartholomew  et  ah, 
2008;  Norusis,  2011).  Before  computing  Ward’s  procedure,  we 
sorted  the  existing  data  randomly.  Hierarchical  analysis  can  be 
sensitive  to  order  because  the  analysis  begins  at  one  case  and 
systematically  assesses  difference  between  each  case  such  that  the 
order  of  the  cases  can  affect  the  initial  cluster  centers  and,  thus, 
how  clusters  begin  forming  (Norusis,  2011). 

To  select  the  optimum  number  of  clusters  for  our  cluster  solu¬ 
tion  (i.e.,  a  stopping  point  in  the  cluster  analysis)  and  maximize 
internal  validity  (Bacher,  2002),  we  implemented  two  methods:  (a) 
we  examined  the  fusion  coefficients  provided  in  an  agglomeration 
schedule  (an  SPSS  output)  for  measures  of  change  as  clusters 
merged;  and  (b)  we  used  the  Bayesian  Information  Criterion  (BIC) 
to  determine  the  optimal  number  of  clusters  and  model  (Fraley  & 
Raferty,  1998;  Nylund,  Asparouhov,  &  Muthen,  2007)  using  the  R 
Project  for  Statistical  Computing  (R  Project)  software  (Mclust 
package). 

When  a  large  decrease  in  the  fusion  coefficient  is  notable 
between  two  steps,  clusters  merged  that  caused  a  substantial 


Table  3 

Cronbach’s  Alpha  Values  and  Fit  Indices 


Cronbach’s  alpha  values 


Year 

n 

M 

U 

S 

I 

C 

Identity 

CFI 

SRMR 

RMSEA 

2012 

321 

.72 

.78 

.83 

.77 

.84 

.82 

.97 

.052 

.069 

2013 

308 

.72 

.83 

.77 

.76 

.79 

.83 

.96 

.058 

.076 

2014 

284 

.78 

.82 

.85 

.78 

.77 

.83 

.98 

.050 

.058 

Note. 

CFI,  RMSEA,  and  SRMR  are  values  from  CFAs  that 

were  conducted  with  all  of  the  items  from  the 

MUSIC  Inventory  (i.e.,  empowerment  [M],  usefulness  [U],  success  [S],  interest  [I],  and  caring  [C])  and  science 

identification  (Identity)  scales  for  each  year  separately. 


1170 


CHITTUM  AND  JONES 


change  in  overall  within-cluster  dissimilarity.  Smaller  change  co¬ 
efficients  between  subsequent  steps  indicate  that  those  clusters 
bear  similar  heterogeneity;  thus,  merging  clusters  during  those 
stages  “adds  much  less  to  distinguishing  between  cases”  (Bums  & 
Burns,  2008,  p.  560).  We  designated  the  stopping  point  (i.e.,  the 
cluster  solution,  or  number  of  clusters)  when  cluster  coefficients 
indicated  a  large  change  such  that  later  steps  became  markedly 
more  similar  (Bums  &  Bums,  2008,  p.  561;  Norusis,  2011).  When 
multiple  cluster  solutions  appeared  suitable,  we  examined  each 
solution’s  cluster  centers  and  selected  the  solution  representative 
of  the  most  parsimonious  and  theoretically  meaningful  model 
(Bacher,  2002;  Shell  &  Soh,  2013;  Turner  et  al„  1998;  Vansteen- 
kiste  et  al.,  2009).  Finally,  we  tested  other  potentially  viable 
solutions  to  determine  whether  they  rendered  more  appropriate, 
meaningful  solutions  (Shell  &  Soh,  2013),  analyzing  the  relative 
validity  of  the  cluster  solutions  (Bacher,  2002). 

Next,  we  used  BIC  for  expectation-maximization  to  validate  the 
previous  analysis  and  confirm  the  number  of  clusters.  In  this 
procedure,  data  are  partitioned  through  a  blend  of  agglomerative 
hierarchal  clustering  procedures  for  Guassian  mixture  models,  and 
the  expectation-maximization  algorithm  (Fraley  &  Raferty,  1998). 
Then,  BIC  is  implemented  to  compare  multiple  models  and  deter¬ 
mine  the  optimal  solution.  For  this  test,  a  parameter  for  modeling 
(i.e.,  maximum  number  of  clusters/models)  is  set  in  advance.  We 
ran  the  test  twice:  once  with  100  as  the  parameter  and  a  second 
time  with  12. 

A  limitation  of  hierarchical  analysis  is  that,  once  a  case  has  been 
assigned  a  particular  cluster,  it  cannot  be  unassigned  (Asendorpf, 
Borkenau,  Ostendorf,  &  van  Aken,  2001;  Kaufman  &  Rousseeuw, 
1990).  In  other  words,  cases  cannot  move  to  different,  perhaps 
more  appropriate  clusters  later  during  the  analysis  as  the  clusters 
take  shape  and  deviate  naturally  from  the  initial  formation.  Hence, 
it  is  important  to  complete  an  additional  clustering  method  using 
the  cluster  solution  determined  with  Ward’s  method  (Norusis, 

2011) .  With  this  limitation  in  mind,  we  computed  Umeans  cluster 
analysis  as  a  secondary  test,  which  is  considered  a  validation 
procedure  (Mooi  &  Sarstedt,  2011)  and  test  of  stability  (Bacher, 
2002).  A'-means  cluster  analysis  involves  selecting  a  predeter¬ 
mined  number  of  clusters  ( k ) — we  used  the  number  of  clusters 
defined  by  Ward’s  method — to  “fine  tune”  the  cluster  centers 
(Bacher  et  al.,  2004;  Huberty  et  al.,  2005;  Wormington  et  al., 

2012) .  The  k-means  procedure  is  used  as  a  secondary  test  because 
hierarchical  analysis  is  needed  initially  to  determine  the  optimal 
number  of  clusters,  which  serves  as  k.  Unlike  hierarchical  cluster 
analysis,  k-means  clustering  allows  cases  to  flow  through  multiple 
iterations  such  that  cases  can  change  their  cluster  assignment  as  the 
analysis  matures;  thus,  the  resulting  cluster  centers  are  more  reli¬ 
able  and  accurate.  Iterations  begin  with  a  set  of  cluster  centers 
whereby  cases  are  classified  per  their  distance  to  that  centroid 
(Norusis,  2011).  Then,  each  cluster  center  from  the  previous  step 
is  recomputed.  Next,  cases  are  assigned  again  to  cluster  centers 
based  on  the  new  averages  and  the  aforementioned  steps  are 
repeated  until  there  is  little  change  in  the  cluster  centers  between 
steps  (Norusis,  2011).  The  final  iteration  ends  with  each  case 
assigned  to  a  permanent  cluster  and  the  final  cluster  centers  are 
computed  (Norusis,  201 1).  To  assess  the  reliability  of  the  clusters, 
we  compared  the  hierarchical  and  fc-means  cluster  solutions  using 
Cohen’s  kappa1 * * * * &  (k;  Reilly,  Wang,  &  Rutherford,  2005),  with  a 


value  considered  acceptable  at  .60  or  higher  when  comparing 
clusters  (Asendorpf  et  al.,  2001;  Vansteenkiste  et  al.,  2009). 

We  validated  our  cluster  solution  in  several  ways.  First,  we  ran 
the  same  analyses  with  multiple  subsets  of  the  population,  includ¬ 
ing  analyzing  the  data  from  multiple  years  and  grade  levels.  In 
addition,  we  recomputed  several  cluster  analyses  with  cases  sorted 
randomly  to  assess  stability  (Bacher,  2002).  Then,  we  used  a 
formal  double-split  cross-validation  procedure  in  which  we  split 
the  subsample  into  two  random  halves,  recomputed  the  two-step 
clustering  procedure  followed  by  a  nearest  neighbor  analysis  as  a 
reliability  measure  (Breckenridge,  2000).  Then,  we  compared  the 
nearest  neighbor  solution  to  the  two-step  clustering  solution  using 
Cohen’s  k,  with  k  >  .60  considered  acceptable  fit  for  this  test 
(Breckenridge,  2000;  Wormington  et  al.,  2012).  Only  to  verify  that 
the  clusters  were  statistically  different  (Mooi  &  Sarstedt,  2011) 
and  produce  more  evidence  for  the  internal  validity  (Bacher, 
2002),  we  examined  one-way  ANOVAs  with  the  five  clustered 
variables  as  the  dependent  variables  and  cluster  membership  as  the 
factor. 

To  determine  appropriate  cluster  typology  while  preserving  the 
multivariate  properties  of  the  analysis,  we  computed  a  discriminant 
function  analysis  (Burns  &  Burns,  2008;  Jung,  Owusu-Antwi,  & 
An,  2006;  Weissman  &  Magill,  2008).  Discriminant  analysis  is  a 
multivariate  method  that  distinguishes  between  groups  based  on 
several  variables  (Galbraith  &  Jiaqing,  1999)  and  can  be  used  to 
characterize,  or  profile,  clusters  (Jung  et  al.,  2006).  In  essence,  the 
variables  that  contributed  most  in  distinguishing  between  groups 
are  highlighted  (Hale  &  Glassman,  1986).  We  included  the  cluster 
membership  as  the  dependent  variable  and  averages  of  each  MU¬ 
SIC  model  variable  as  the  independent  variables.  In  the  present 
study,  discriminant  analysis  served  in  a  descriptive  function  only; 
its  common  utility  as  a  function  of  probability  was  irrelevant 
(Fraley  &  Raftery,  2002;  Weissman  &  Magill,  2008). 

To  examine  RQ2,  we  tested  the  predictive  validity  of  the  clus¬ 
tering  solution  by  running  several  one-way  ANOVAs  with  theo¬ 
retically  correlated  variables,  including  science  identification,  sci¬ 
ence  class  effort,  science  course  intentions,  and  science  career 
goals.  Finally,  to  address  RQ3,  we  ran  Pearson  chi-square  tests  to 
investigate  differences  between  genders  and  grade  levels  within 
the  clusters. 

Results 

Table  4  includes  correlations  among  all  tested  variables  from 
2013  and  2014.  As  predicted  by  the  domain  identification  model 
(Osborne  &  Jones,  2011),  most  of  the  correlations  among  class 
perceptions,  science  identification,  science  career  goals,  science 


1  Cohen’s  kappa  (k)  is  frequently  used  as  a  measure  of  interrater  reli¬ 

ability  (von  Eye  &  Mun,  2005).  During  the  test,  two  categorical  variables 

are  compared  for  each  case.  The  total  number  of  instances  in  which  both 

categorical  variables  were  the  same  for  each  case  in  a  data  set  (i.e.,  both 
raters  entered  identical  codes  or  categories  for  an  excerpt)  is  computed, 

which  is  also  considered  the  level  of  agreement  among  the  raters  (von  Eye 

&  Mun,  2005).  An  acceptable  level  of  agreement  can  be  judged  from  k, 
which  ranges  in  value  from  0.00  to  1.00,  and  can  depend  on  the  purpose  of 
the  test.  According  to  Landis  and  Koch  (1977),  greater  than  0.81  is 
considered  near  perfect  agreement,  0.61  to  0.80  substantial,  0.41-0.60 
moderate,  0.21  to  0.40  fair,  0.01  to  0.20  slight,  and  0.00  poor.  Similarly, 
Fleiss  (1981)  posited  that  greater  than  0.75  is  considered  excellent,  0.40  to 
0.75  is  considered  good,  and  below  0.40  is  considered  poor. 
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Table  4 


Correlations  and  Descriptive  Statistics  (2013  and  2014) 


Variable 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

1.  Grade  level 

2.  Sci.  identification 

-.265** 

-.183** 

.006 

.490** 

.036 

.541** 

-.177** 

.798** 

-.133* 

.520** 

-.123* 

.570** 

-.116 

.671” 

-.135* 

.629** 

-.228” 

474** 

3.  Sci.  career  goals 

-.127* 

.571*’ 

— 

.657** 

.387** 

.323** 

.644** 

.325** 

.399** 

.179” 

4.  Sci.  course  intent 

-.092 

.545** 

724** 

— 

.469** 

.364** 

.524** 

.389” 

492** 

.187” 

5.  Effort 

-.248** 

.841** 

.451** 

.438** 

_ 

.521** 

.566” 

.692** 

.670” 

.510” 

6.  Empowerment 

-.141* 

.474** 

.361** 

.332** 

.533** 

.581” 

.482** 

.611” 

.376” 

7.  Usefulness 

-.159** 

.657** 

.678** 

.602** 

.600** 

.517** 

.464” 

.698” 

.277** 

8.  Success 

-.188** 

.648** 

.417** 

.424** 

.675** 

.505** 

497** 

.614” 

.643** 

9.  Interest 

-.175** 

.704** 

.526** 

.532** 

.705** 

.565** 

.712” 

.586” 

.444** 

10.  Caring 

-.175** 

42i** 

.241** 

.223** 

.496** 

.389** 

.307” 

.606** 

.389” 

2014  M  ( SD ) 

5.95  (0.88) 

4.53  (1.11) 

3.39  (1.57) 

3.30(1.70) 

4.74(1.07) 

4.17  (1.16) 

4.14(1.33) 

4.85  (1.47) 

4.26  (1.27) 

5.12(0.97) 

2013  M  (SD) 

6.06  (0.84) 

4.36(1.17) 

3.23  (1.55) 

3.21  (1.77) 

4.61  (1.18) 

4.12  (1.12) 

3.96  (1.38) 

4.83  (1.01) 

4.20(1.27) 

4.98  (1.08) 

Note.  The  2014  sample  is  in  the  upper  diagonal  of  the  matrix  and  the  2013  sample  is  in  the  lower  diagonal  of  the  matrix.  Results  are  available  for  the 
2012  sample  upon  request.  Sci.  =  science;  Sci.  course  intent  =  science  course  intentions.  2014  n  =  284.  2013  n  =  308 
*  p  <  .05.  *><.01. 


course  intentions,  and  science  class  effort  were  positive  and  sta¬ 
tistically  significant,  and  most  of  them  were  moderate  to  strong. 
An  exception  was  that  caring  was  only  weakly  correlated  with  both 
science  course  intentions  and  career  goals.  Grade  level  was  cor¬ 
related  negatively  with  all  of  the  variables  except  for  science 
course  intentions  in  the  2013  and  2014  samples,  and  science  career 
goals  in  the  2014  sample.  This  finding  that  students  at  the  higher 
grades  reported  lower  science  class  perceptions  than  students  at  the 
lower  grades  is  consistent  with  previous  findings  (Wigfield, 
Eccles,  Schiefele,  Roeser,  &  Davis-Kean,  2006).  The  overall  pat¬ 
terns  of  these  correlations  are  consistent  with  theory  and  previous 
research  (Eccles  &  Wigfield,  1995;  Jones  et  al.,  2014;  Jones  & 
Skaggs,  2016;  Osborne  &  Jones,  2011;  Wang  &  Eccles,  2013). 

Cluster  Analyses 

The  first  step  of  our  analysis  involved  removing  univariate  and 
multivariate  outliers.  We  removed  cases  with  one  or  more  variable 
means  3  standard  deviations  above  or  below  each  overall  variable 
mean  (Vijendra  &  Shivani,  2014),  which  included  18  (1.9%) 
univariate  outliers.  Next,  we  ran  initial  A-means  cluster  analyses  to 
identify  and  remove  cases  that  formed  any  extremely  small  clus¬ 
ters  (i.e.,  clusters  with  very  few  cases;  Jiang,  Tseng,  &  Su,  2001; 
Kaufman  &  Rousseeuw,  1990),  which  accounted  for  three  multi¬ 
variate  outliers.  We  utilized  this  procedure  because  A-means  clus¬ 
tering  is  especially  sensitive  to  outliers,  often  forming  small  N 
clusters  that  use  outlier  cases  as  cluster  centers  (Kaufman  & 
Rousseeuw,  1990;  Norusis,  2011).  In  all,  we  removed  21  outliers 
(2.2%),  which  left  a  total  sample  of  913  cases. 

A  limitation  of  cluster  analysis  is  that  the  method  may  not 
produce  any  meaningful  or  repeatable  solutions,  as  clustering  is 
primarily  an  exploratory  method  and  can  depend  heavily  on  the 
structure  of  the  sample  (Bartholomew  et  al.,  2008;  Lange,  Roth, 
Braun,  &  Buhmann,  2004).  A  cluster  solution  is  considered  more 
robust  and  stable  when  it  is  repeated  under  different  circumstances 
(e.g.,  different  clustering  algorithms,  reordered  cases,  diverse  sam¬ 
ples  or  subsamples;  Bacher,  2002;  Lange  et  al.,  2004;  Norusis, 
2011).  Cluster-analyzing  subsamples  characterized  by  specific 
variables  (e.g.,  grade  level,  year)  can  test  whether  those  variables 
influence  the  cluster  solutions  (Bacher,  2002).  Accordingly,  we 


computed  multiple  cluster  analyses  in  two  main  stages  to  explore 
the  profiles  and  examine  their  stability  across  subsamples.  First, 
we  conducted  two-step  cluster  analyses  for  cases  at  each  year  point 
(2012  n  =  321,  2013  n  =  308,  2014  n  =  284)  in  three  separate 
two-step  analyses,  which  were  our  primary  cluster  analyses.  Sec¬ 
ond,  to  test  the  stability  of  the  cluster  solutions  obtained  in  the  first 
stage  (Bacher,  2002),  we  computed  the  two-step  clustering  proce¬ 
dure  for  cases  at  each  grade  level  (fifth  n  =  324,  sixth  n  =  263, 
seventh  n  =  326)  across  the  three  years  in  three  separate  two-step 
analyses.  We  conducted  the  final  tests  only  to  confirm  stability  of 
the  motivation  profiles  already  identified. 

In  cluster  analysis,  at  least  two  observations  for  each  variable 
(2:1)  with  a  minimum  of  200  observations  is  considered  an  ac¬ 
ceptable  ratio  (Egan,  1984).  Because  we  had  913  observations  and 
five  variables  (913:5),  our  sample  of  913  cases  was  more  than 
adequate.  In  addition,  our  sample  sizes  for  the  year  analyses  (2012 
n  =  321,  2013  n  =  308,  2014  n  =  284)  and  grade  level  analyses 
(fifth  n  =  324,  sixth  n  ~  263,  seventh  n  =  326)  were  also 
adequate.  Similar  cluster  analyses  in  academic  motivation  litera¬ 
ture  included  comparable  sample  sizes  (Vansteenkiste  et  al.,  2009; 
Hartwell  &  Kaplan,  2014;  Shell  &  Soh,  2013;  with  sample  sizes  of 
291  and  484  [two  analyses],  139,  and  233,  respectively). 

Stage  I:  Clusters  per  year.  We  selected  a  five-cluster  solu¬ 
tion  as  the  best  description  of  the  data  for  the  three  hierarchical 
analyses  at  each  year  point  (2012,  2013,  and  2014)  based  on  our 
consideration  of  the  fusion  coefficients  and  our  theoretical  inter¬ 
pretation.  We  reached  this  solution  for  each  year  independently. 
Our  secondary  method  using  BIC  also  rendered  a  five-cluster 
solution  under  both  parameters  (12,  100  set  as  the  maximum 
number  of  clusters  allowed),  confirming  this  choice.  Furthermore, 
the  cluster  centers  aligned  between  years,  as  shown  in  Table  5  and 
Figures  2  and  3.  We  also  examined  A-means  analyses  of  three-  and 
four-cluster  solutions,  and  determined  that  they  did  not  provide 
more  meaningful  or  interpretable  solutions.  The  four-cluster  solu¬ 
tion  combined  cases  from  Clusters  2  and  4  from  the  five-cluster 
solution.  The  five-cluster  solution  added  meaning  by  parsing  out 
those  students  whose  perceived  usefulness  and  interest  were  either 
somewhat  more  negative  (Cluster  2)  or  positive  (Cluster  4)  and 
assigned  them  to  separate  clusters.  The  three-cluster  solution, 


1172 


CHITTUM  AND  JONES 


Table  5 


Five-Cluster  Solution:  Comparisons  Among  Years 


MUSIC  component 

Year 

Clusters 

1 

2 

3 

4 

5 

Empowerment 

2012 

2.7s1 

4.0sh 

4.0sh 

3.8sh 

5.4h 

2013 

2.6  s1 

4.2sh 

4.0sh 

4.1sh 

5.1 11 

2014 

2.31 

3.3s1 

4.0sh 

4.0sh 

5.1h 

Usefulness 

2012 

2.41 

2.5s1 

4  ,sh 

4  4sh 

5.5vh 

2013 

2.3' 

2.5s1 

3.7sh 

4.3sh 

5.5vh 

2014 

2.4' 

2.4' 

4.0sh 

3.8sh 

5.4h 

Success 

2012 

3.0s1 

4.6h 

4.0sh 

5.2h 

5.7vh 

2013 

3.4  s1 

4.9h 

4.1sh 

5.2h 

5.6vh 

2014 

2.5s1 

4.6h 

3.9sh 

5.3h 

5.6vh 

Interest 

2012 

2.01 

3.2s1 

3.6sh 

4.5h 

5.4h 

2013 

2.4' 

3.0s1 

4.0,h 

4.6h 

5.5vh 

2014 

2.11 

2.6s' 

3.9sh 

4.4sh 

5.4h 

Caring 

2012 

2.8s1 

5.4h 

3.5sh 

5.4h 

5.6vh 

2013 

3.8sh 

5.6vh 

3.7sh 

5.5vh 

5.6vh 

2014 

3.6sh 

5.4h 

3.9sh 

5.5vh 

5.6vh 

Cluster  N  (%) 

2012 

29  (8.95%) 

37  (11.42%) 

56  (17.28%) 

92  (28.40%) 

107  (33.02%) 

2013 

45  (14.61%) 

40  (12.99%) 

54  (17.53%) 

92  (29.87%)- 

77  (25.00%) 

2014 

24  (8.45%) 

33  (11.62%) 

48  (16.90%) 

75  (26.41%) 

104  (36.62%) 

Note.  All  variables  were  significantly  different  between  all  clusters,  p  <  .001.  2012  n  =  321;  2013  n  —  308; 
2014  n  =  284;  N  =  913. 

vl  Very  low  =  1.0  to  1.4.  1  Low  =  1.5  to  2.4.  sl  Somewhat  low  =  2.5  to  3.4.  sh  Somewhat  high  =  3.5  to 
4.4.  h  High  =  4.5  to  5.4.  vh  Very  high  =  5.5  to  6.0. 


though  parsimonious,  did  not  adequately  describe  the  data,  pro¬ 
viding  only  “high,”  “middle,”  and  “low”  clusters  in  which  the 
nuances  of  motivation  were  lost. 

Initial  cluster  typology.  Cluster  solutions  should  also  be  in¬ 
terpretable  with  names  and  classifications  informed  by  theory 
(Bacher,  2002).  To  define  the  cluster  profiles,  we  first  organized 


the  cluster  centers  into  six  categories  that  described  the  students’ 
reported  perceptions,  per  the  6-point  scale;  very  low  (1.0  to  1.4), 
low  (1.5  to  2.4),  somewhat  low  (2.5  to  3.4),  somewhat  high  (3.5  to 
4.4),  high  (4.5  to  5.4),  and  very  high  (5.5  to  6.0).  We  selected 
terminology  that  describes  an  amount  or  quantity  of  science  class 
perceptions,  similar  to  Daniels  et  al.’s  (2008),  Vansteenkiste  et 
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Figure  2.  Cluster  centers  for  each  year  (2012  to  2014).  This  figure  shows  how  the  different  clusters  are  stable 
across  the  years.  The  five  clusters  are  differentiated  by  different  shades  and  marker  styles:  black  with  “X” 
marker  =  Cluster  5;  dark  gray  with  square  marker  =  Cluster  4;  light  gray  with  diamond  marker  =  Cluster  3; 
black  with  triangle  marker  =  Cluster  2;  dark  gray  with  circle  marker  =  Cluster  1 .  Years  are  demarcated  with 
different  lines:  solid  =  2012;  large  dashes  =  2013;  small  dashes  =  2014. 
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■  Cluster  5 


•  Cluster  4 


Cluster  3 


■  Cluster  2 


“Cluster  1 


Figure  3.  Collapsed  cluster  centers  based  on  year.  This  figure  simplifies  the  visual  of  cluster  centers.  Each  line 
represents  the  mean  of  the  three  years  (2012,  2013,  2014)  for  the  five  MUSIC  model  components,  per  cluster. 
The  five  clusters  are  differentiated  by  different  shades  and  line  styles:  black  solid  line  =  Cluster  5;  dark  gray 
solid  line  =  Cluster  4;  light  gray  solid  line  =  Cluster  3;  black  dashed  line  =  Cluster  2;  dark  gray  dashed  line  = 
Cluster  1. 


al.’s  (2009),  Wormington  et  al.’s  (2012),  and  Hayenga  and  Cor¬ 
pus’s  (2010)  conception  of  low  or  high  “quantity”  motivation.  The 
items  in  the  MUSIC  Inventory  measure  the  quantity  of  students’ 
perceptions  about  science  class  rather  than  the  quality  of  those 
perceptions. 

Using  the  very  low  to  very  high  categories  to  explain  each 
variable’s  cluster  center  within  the  overall  cluster  membership,  our 
initial  characterization  of  the  five  clusters  was  as  follows:  (a)  low 
motivation;  (b)  low  usefulness  and  interest,  moderate  empower¬ 
ment,  and  high  success  and  caring;  (c)  somewhat  high  motivation; 
(d)  somewhat  high  empowerment,  usefulness,  and  interest,  and 
high  success  and  caring;  and  (e)  high  motivation.  Findings  for 
these  analyses  are  displayed  in  Table  5.  The  average  percentage  of 
students  per  cluster  was  similar  across  years,  with  the  majority  of 
students  (55%  to  63%)  assigned  to  Clusters  4  and  5.  One-way 
ANOVAs  indicated  that  all  five  clusters  were  significantly  differ¬ 
ent  on  the  clustered  variables  (p  <  .001),  which  is  expected 
because  the  purpose  of  cluster  analysis  is  to  maximize  within- 
cluster  homogeneity  and  between-cluster  heterogeneity.  Intercor¬ 
relations  were  also  low  (—.006  to  .110),  supporting  this  conclu¬ 
sion. 

Discriminant  analysis.  To  further  distinguish  each  motivation 
profile,  we  computed  a  discriminant  factor  analysis  (Bums  & 
Bums,  2008)  with  the  2014  dataset  (n  =  284).  We  used  the  2014 
dataset  as  an  exemplar  because  the  motivation  profiles  were  rela¬ 
tively  similar  across  years.  Four  functions  emerged,  which  was 
expected  because  the  maximum  number  of  functions  possible  is 

I 

Table  6 

Discriminant  Function  Analysis,  2014 


the  number  of  clusters  minus  one  (Bums  &  Burns,  2008).  Table  6 
includes  stmcture  coefficients  for  the  four  functions.  Function  1 
(D()  is  the  dominant  function,  as  it  explained  83%  of  the  between- 
groups  variance  and,  together,  D,  and  D2  explained  98.5%.  D3  and 
D4  were  responsible  for  a  negligible  amount  of  explained  variance 
(1.6%  combined)  and  were  not  key  factors  in  cluster  membership 
for  any  of  the  five  profiles.  Accordingly,  and  with  our  descriptive 
intention  in  mind,  we  omitted  D3  and  D4  from  these  results  to 
maintain  a  parsimonious  model. 

We  interpreted  D,  and  D2  using  the  discriminant  loadings, 
which  are  Pearson  correlations  between  the  functions  and  MUSIC 
model  variables,  and  indicate  which  variables  were  most  important 
or  influential  within  each  function  (Burns  &  Burns,  2008).  D,  is 
associated  with  a  high  level  of  interest,  success,  and  usefulness  (in 
this  order).  Empowerment  and  caring  were  considered  less  critical 
factors.  D2  is  primarily  associated  with  a  high  level  of  perceived 
caring  (the  most  important  predictor),  as  well  as  with  high  per¬ 
ceived  success  and  low  usefulness,  which  were  similarly  weighted. 
Neither  empowerment  nor  interest  was  significant  in  this  model. 

Table  7  shows  the  discriminant  functions  at  each  cluster  cen¬ 
troid.  The  discriminant  function  coefficient  at  Cluster  1  indicates 
very  low  interest,  success,  and  usefulness  (—5.78),  indicating  that 
students’  very  low  perceptions  of  these  MUSIC  components  were 
prominent  influences  of  membership  in  the  low  motivation  profile. 
Empowerment  was  not  a  meaningful  factor  in  the  Cluster  1  mem¬ 
bership  and  associations  with  all  other  functions  were  low.  Cluster 
2  membership  suggests  low  interest,  success,  and  usefulness 


Canonical 

Function  Eigenvalue  %  of  variance  correlation  x2  df  Interpretation 


D,  7.165  83.1  .937  853.7  20  High  interest,  success,  and  usefulness 

Dt  1.324  15.4  .755  269.9  12  High  caring  and  success,  low  usefulness 

D3  0.082  1.0  .275  35.5  6  — 

D4  0.050  0.6 _ .219  13.6  2 _ — _ 
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Table  7 

Unstandardize d  Canonical  Discriminant  Functions  at  Cluster 
Centroid,  2014 


Cluster 

Function 

D, 

High  interest, 
success,  and 
usefulness 

d2 

High  caring  and 
success,  and  low 
usefulness 

d3 

d4 

1 

-5.783 

-0.881 

0.029 

0.519 

2 

-2.817 

2.087 

0.436 

-0.195 

3 

-1.614 

-1.733 

-0.033 

-0.331 

4 

0.268 

0.855 

-0.422 

-0.004 

5 

2.780 

-0.276 

0.175 

0.098 

Note.  Bold  font  indicates  the  most  important  factors  to  cluster  member¬ 
ship. 


(-2.82)  and,  at  the  same  time,  high  caring  and  success,  and  low 
usefulness  (2.09).  Cluster  2  is  consistent  with  a  profile  with  high 
caring,  moderate  success,  and  very  low  interest  and  usefulness.  We 
indicated  a  moderate  level  of  success  because  the  success  variable 
was  contradictory  between  functions,  with  low  success  in  D,  and 
high  success  in  D2,  and  the  canonical  discriminant  function  coef¬ 
ficients  were  similarly  weighted.  Empowerment  was  not  a  mean¬ 
ingful  factor  in  the  Cluster  2  membership  and  influences  of  other 
functions  were  low.  Cluster  3  was  moderately  low  on  the  high 
interest  and  success,  and  low  usefulness  factor  (—1.61),  and  mod¬ 
erately  low  on  the  high  caring  and  success,  and  low  usefulness 
factor  (—1.73).  Thus,  Cluster  3  suggests  that  these  students  held 
fairly  moderate  to  somewhat  high  perceptions  of  all  variables,  and 
that  empowerment  was  not  an  influential  factor  in  the  Cluster  3 
membership.  Cluster  4  indicates  that  no  single  variable  or  factor 
was  especially  influential  in  cluster  membership  in  that  no  function 


was  particularly  significant;  rather,  the  similar  correlation  coeffi¬ 
cients  suggest  a  combination  of  several  influential  variables.  Fi¬ 
nally,  students  in  Cluster  5  indicated  high  interest,  success,  and 
usefulness  (2.78),  which  was  more  important  to  their  cluster  mem¬ 
bership.  These  findings  are  the  inverse  of  Cluster  1,  the  low 
motivation”  profile,  in  which  extremely  low  perceptions  of  inter¬ 
est,  success,  and  usefulness  held  weight. 

Final  cluster  typology.  Combining  results  of  the  discriminant 
analysis  with  our  earlier  categorization  based  on  the  cluster  centers 
of  each  MUSIC  model  variable,  we  developed  the  following  labels 
to  describe  each  cluster:  (a)  low  motivation',  (b)  low  usefulness  and 
interest,  but  high  success  and  caring;  (c)  somewhat  high  motiva¬ 
tion;  (d)  somewhat  high  motivation,  and  high  success  and  caring; 
and  (e)  high  motivation. 

Stage  II:  Stability  tests.  To  test  the  stability  of  the  clusters 
when  organized  into  different  subsets,  we  followed  the  same 
two-step  clustering  procedure  for  separate  grade  levels  (fifth  n  = 
324,  sixth  n  =  263,  seventh  n  =  326)  rather  than  years.  These 
stability  tests  were  intended  to- reduce  the  teacher  effect  and  effects 
of  unknown  and  contextual  variables.  We  found  that  the  cluster 
solution  and  cluster  centers  remained  stable.  The  five-cluster  so¬ 
lution  best  fit  each  grade  level  and  the  cluster  centers  aligned  with 
the  2012,  2013,  and  2014  clusters.  See  Table  8  for  the  cluster 
centers  by  grade  level  and  Figures  4  and  5  for  visual  representa¬ 
tions  of  the  five  clusters  and  their  stability  at  each  grade  level. 

Gender  and  grade  level  associations.  Given  previous  re¬ 
search  citing  gender  and  age  effects  on  science  motivation  (Bong 
et  al.,  2015;  Maltese  &  Tai,  2010;  Meece  et  al.,  2006),  we  inves¬ 
tigated  gender  and  grade  level  differences  among  motivation  pro¬ 
files.  Pearson  chi-square  tests  revealed  significant  differences  be¬ 
tween  genders,  x2(4,  N  =  913)  =  13.45,/;  =  .001  and  grade  levels, 
X2(8,  N  =  913)  =  48.01,  p  <  .001  (see  Figures  6  and  7).  A  higher 


Table  8 


Five-Cluster  Solution:  Comparisons  Among  Grade  Levels 


MUSIC  component 

Year 

Clusters 

1 

2 

3 

4 

5 

Empowerment 

5th 

2.21 

3.5sh 

4.0sh 

4.2sh 

5.4h 

6th 

2.8s1 

3.8sh 

4  lsh 

4.0sh 

5.1h 

7th 

2.8s1 

2.9s1 

4.0sh 

3.9sh 

5.1h 

Usefulness 

5th 

2.5s1 

2.31 

3.9sh 

4.5h 

5.6vh 

6th 

2.31 

2.5s1 

3.9sh 

4.2sh 

5.4h 

7th 

2.9s1 

1.81 

4.0sh 

3.8sh 

5.3h 

Success 

5  th 

2.5s1 

4.6h 

4.25h 

5.3h 

5.7vh 

6th 

3.0s1 

4.7h 

3.5sh 

5.0h 

5.7vh 

7th 

3.0s1 

4.2sh 

4.3sh 

5.2h 

5.6vh 

Interest 

5  th 

2.3' 

3.0s1 

3.9sh 

4.8h 

5.5vh 

6th 

2.11 

3.1s1 

4.0sh 

4.3sh 

5.5vh 

7th 

2.3‘ 

2.2' 

3.9sh 

4.2sh 

*5.3h 

Caring 

5th 

3.0s1 

5.6vh 

3.9sh 

5.6vh 

5.8vh 

6th 

3.5sh 

5.3h 

3.6sh 

5.3h 

5.5vh 

7th 

2.7s1 

5.2h 

3.6sh 

5.5vh 

5.5vh 

Cluster  N(%) 

5th 

15  (4.62%) 

42(12.92%) 

50(15.38%) 

113  (34.56%) 

105  (32.31%) 

6th 

30(1 1.36%) 

40(15.15%) 

32(12.12%) 

75  (28.41%) 

87  (32.95%) 

7th 

42(12.84%) 

34(10.40%) 

66  (20.18%) 

94  (28.75%) 

91  (27.83%) 

Note.  All  variables  were  significantly  different  between  all  clusters,  p  <  .001 .  Fifth-grade  n  =  324;  sixth-grade 
n  =  263;  seventh-grade  n  =  326;  N  =  913. 

vl  Very  low  =  1.0  to  1.4.  1  Low  =  1.5  to  2.4.  sl  Somewhat  low  =  2.5  to  3.4.  sh  Somewhat  high  =  3.5  to 
4.4.  h  High  =  4.5  to  5.4.  vh  Very  high  =  5.5  to  6.0. 
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Figure  4.  Cluster  centers  for  each  grade  level  (fifth,  sixth,  seventh).  This  figure  shows  how  the  different 
clusters  are  stable  across  the  grade  levels.  The  five  clusters  are  differentiated  by  different  shades  and  marker 
styles:  black  with  “X”  marker  =  Cluster  5;  dark  gray  with  square  marker  =  Cluster  4;  light  gray  with  diamond 
marker  =  Cluster  3;  black  with  triangle  marker  =  Cluster  2;  dark  gray  with  circle  marker  =  Cluster  1.  Grade 
levels  are  demarcated  with  different  lines:  solid  =  fifth-grade;  large  dashes  =  sixth-grade;  small  dashes  = 
seventh-grade. 


proportion  of  female  students  were  assigned  to  Clusters  4  and  5, 
and  a  lower  proportion  were  in  Cluster  3.  Inversely,  males  were 
overrepresented  in  Cluster  3  and  underrepresented  in  Cluster  5. 
Fifth-grade  students  were  overrepresented  in  Cluster  5  and  under¬ 
represented  in  Clusters  1  and  3.  Inversely,  seventh-grade  students 
were  overrepresented  in  Clusters  1  and  3,  and  underrepresented  in 
Cluster  5.  Sixth-grade  students  were  underrepresented  in  Cluster  5. 

Students’  Cluster  Membership  Across  Years 

To  test  whether  or  not  students  remained  in  the  same  clusters 
every  year,  we  selected  the  167  students  who  completed  the 


questionnaire  at  more  than  one  time  point  (i.e.,  at  2012  and  2013, 
and/or  2012  and  2014,  and/or  2013  and  2014),  and  we  ran  a  series 
of  Cohen’s  k  tests  to  compare  their  cluster  memberships  between 
years.  We  found  that  students’  cluster  memberships  varied  across 
years:  between  2012  and  2013,  k  =  .191  (n  =  149);  between  2012 
and  2014,  k  =  .135  ( n  =  65);  and  between  2013  and  2014,  k  = 
.290  ( n  =  47).  We  were  able  to  test  a  total  of  261  comparisons 
between  two  separate  years  (2012/2013,  2012/2014,  2013/2014) 
because  some  of  the  167  students  were  assessed  all  three  years. 
Examined  another  way,  of  the  students  for  which  we  had  data 
spanning  more  than  one  year,  37. 1%  did  not  move  to  a  new  cluster, 


eMpowerment  Usefulness  Success  Interest  Caring 


■  Cluster  5 


■  Cluster  4 


Cluster  3 


-  Cluster  2 


-Cluster  1 


Figure  5.  Collapsed  cluster  centers  based  on  year.  This  figure  simplifies  the  visual  of  cluster  centers.  Each  line 
represents  the  mean  of  the  three  years  (2012,  2013,  and  2014)  for  the  five  MUSIC  model  components,  per 
cluster.  The  five  clusters  are  differentiated  by  different  shades  and  line  styles:  black  solid  line  =  Cluster  5;  dark 
gray  solid  line  =  Cluster  4;  light  gray  solid  line  =  Cluster  3;  black  dashed  line  =  Cluster  2;  dark  gray  dashed 
line  =  Cluster  1. 
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Figure  6.  Gender  distribution  in  clusters.  This  figure  includes  the  percentage  of  each  gender  out  of  the  full 
sample  that  was  categorized  into  each  cluster.  Female  n  =  485;  male  n  =  428. 


40.9%  moved  to  a  lower  cluster  number  (e.g.,  Cluster  3  to  Cluster 
1),  and  only  21.8%  moved  to  a  higher  cluster  number  (e.g..  Cluster 
3  to  5)  over  time.  These  findings  suggest  that  cluster  membership 
may  be  somewhat  dependent  on  the  context  of  each  specific 
science  class,  science  teacher,  or  some  other  variable  or  combina¬ 
tion  of  variables. 


Predictive  Validity 

To  provide  more  evidence  for  the  predictive  validity  of  our 
cluster  solution  (Bacher,  2002),  we  completed  several  follow-up 
tests  on  the  2014  data  with  variables  that  have  been  shown  to 
correlate  theoretically  and  empirically:  (a)  identification  for  sci- 
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Figure  7.  Grade  level  distribution  among  clusters.  Because  the  number  of  students  in  each  grade  level  was 
uneven,  this  figure  shows  the  percentage  of  students  in  the  grade  level  that  was  grouped  into  each  cluster.  Grade 
5  n  =  324;  Grade  6  n  =  263;  Grade  7  n  =  326. 
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ence,  (b)  science  career  goals,  (c)  science  course  intentions,  and  (d) 
science  class  effort.  We  selected  the  2014  data  (n  =  284)  to  act  as 
exemplar,  as  the  motivation  profiles  and  ANOVA  results  were 
similar  across  years.  By  selecting  a  single  year,  we  sought  to 
reduce  complexity  and  present  the  information  in  a  coherent 
manner.  One-way  ANOVAs  revealed  significant  differences 
among  motivation  profiles  in  science  class  and  each  outcome 
variable  (see  Table  9).  We  computed  Tukey’s  HSD  and  Games- 
Howell  post  hoc  tests  for  all  variables,  which  we  describe  next 
(Table  10  and  Figure  8). 

Science  identification.  Post  hoc  tests  revealed  significantly 
higher  reported  science  identification  in  Cluster  5  than  in  all  other 
clusters.  Cluster  4  included  significantly  higher  reported  science 
identification  than  Clusters  3,  2,  and  1.  Clusters  3  and  2  included 
statistically  similar  reported  science  identification,  which  was  sig¬ 
nificantly  higher  than  in  Cluster  1.  Cluster  1  had  the  lowest 
reported  science  identification. 

Science  career  goals.  Post  hoc  tests  indicated  significantly 
higher  reported  science  career  goals  in  Cluster  5  than  in  all  other 
clusters.  Clusters  4,  3,  and  2  included  statistically  similar  reported 
career  goals,  as  did  Clusters  2  and  1,  albeit  lower.  Cluster  1  had  the 
lowest  reported  science  career  goals,  which  were  significantly 
lower  than  all  other  clusters  except  Cluster  2. 

Science  course  intentions.  Post  hoc  tests  revealed  that  Cluster 
5  included  significantly  higher  reported  intentions  to  take  science 
courses  in  the  future  than  all  other  clusters.  Clusters  4  and  3 
included  statistically  similar  reported  course-related  intentions,  as 
did  Clusters  1  and  2;  however,  Clusters  1  and  2  were  significantly 
lower  than  all  other  clusters. 

Science  class  effort.  Post  hoc  tests  revealed  that  Cluster  5 
included  higher  reported  science  class  effort  than  all  other  clusters, 
and  Cluster  4  was  higher  than  Clusters  3,  2,  and  1 .  Clusters  3  and 
2  included  statistically  similar  reported  effort  in  science  class,  and 
Cluster  1  had  significantly  lower  reported  effort  than  all  other 
clusters. 

Discussion 

Our  objective  was  to  use  a  person-centered  approach  to  catego¬ 
rize  students  into  multidimensional  motivation  profiles  in  science 
class  based  on  five  well-known  motivation  constructs  that  have 
been  shown  to  relate  to  students’  science  identification  and  inten¬ 


tions  to  persist  in  science.  Our  use  of  person-centered  analyses 
allowed  us  to  identify  complex  patterns  and  offer  a  more  inclusive 
view  of  students’  motivation  than  more  commonly  used  linear 
research  methods  (Meece  &  Holt,  1993).  To  our  knowledge, 
researchers  have  not  yet  used  cluster  analysis  to  examine  the 
motivation  of  pre-high  school  science  students  in  this  manner. 
Our  study  demonstrates  that  patterns  of  science  class  perceptions 
form  five  stable  clusters,  which  are  theoretically  meaningful  and 
depict  the  students’  multidimensional  class-related  motivation  pro¬ 
files. 

We  describe  the  five  profiles  in  more  detail  next,  including  their 
associations  with  other  measures  (as  shown  in  Figure  8),  to  provide 
evidence  of  predictive  validity.  We  classified  each  cluster  accord¬ 
ing  to  the  quantity  of  motivation  reported  (i.e.,  cluster  centers  and 
descriptive  statistics)  and  the  factors  that  were  most  important  to 
cluster  membership,  per  the  discriminant  analysis. 

Profile  Descriptions 

Cluster  1:  Low  motivation  profile.  Students  in  Cluster  1 
reported  a  lower  quantity  of  motivation  for  science  class  than 
students  in  the  other  clusters.  The  motivation  profile  for  these 
students  is  primarily  characterized  by  very  low  perceived  interest, 
success,  and  usefulness.  Although  less  influential  in  determining 
cluster  membership,  they  also  perceived  low  empowerment  in 
science  class  and  felt  that  their  teachers  were  only  moderately 
caring. 

Of  the  five  clusters,  the  students  in  Cluster  1  reported  that  they 
identified  the  least  with  science  and  applied  the  least  amount  of 
effort  to  their  science  classes.  Likewise,  they  reported  little  inten¬ 
tion  to  persist  in  science,  either  by  taking  future  science  courses  or 
considering  the  pursuit  of  a  science-related  career.  These  findings 
are  consistent  with  previous  research  positing  that  students  with 
low  motivation  tend  to  put  forth  little  effort,  have  low  expectancy 
for  success,  and  lack  value  for  the  subject  (Legault,  Green-Demers, 
&  Pelletier,  2006). 

Cluster  2:  Low  usefulness  and  interest,  but  high  success  and 
caring  profile.  Students  in  Cluster  2  reported  that  they  held  low 
usefulness  and  interest  and  had  little  control  over  their  learning  in 
science  class;  however,  they  expected  to  be  successful  and  felt 
cared  for  in  the  classroom.  This  profile  is  primarily  characterized 


Table  9 

One-Way  ANOVA  Results  for  Outcome  Variables 


M(SD) 

SS 

df 

MS 

/ 

Science  identification 

4.52(1.11) 

Between  groups 

156.764 

4 

39.191 

57.036*' 

Within  groups 

191.707 

279 

0.687 

Total 

348.471 

283 

Science  career  goals 

3.39  (1.57) 

Between  groups 

176.891 

4 

44.223 

23.871* 

Within  groups 

516.865 

279 

1.853 

Total 

693.756 

283 

Science  course  intentions 

3.30(1.70) 

Between  groups 

225.828 

4 

56.457 

26.728* 

Within  groups 

589.327 

279 

2.112 

Total 

815.155 

283 

Science  class  effort 

4.74(1.16) 

Between  groups 

151.744 

4 

37.936 

60.878* 

Within  groups 

173.857 

279 

0.623 

Total 

325.601 

283 

>  <  .001. 
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Table  10 


Means  and  Standards  Deviations  per  Cluster  for  Each  Outcome  Variable 


Variable 

Cluster  1 

M  (SD) 

Cluster  2 

M  (SD) 

Cluster  3 

M  (SD) 

Cluster  4 

M  (SD) 

Cluster  5 
M  (SD) 

Science  identification 
Science  career  goals 

Science  course  intentions 
Science  class  effort 

2.88  (0.90) 
2.06(1.46) 

1.88  (1.26) 
3.12(1.17) 

3.94(1.08) 
2.53  (1.32) 
2.06(1.44) 
4.11  (0.88) 

3.93  (0.84) 
3.19(1.30) 
3.00(1.43) 
4.18(0.89) 

4.63  (0.75) 
2.99  (1.20) 
3.01  (1.39) 
4.82  (0.80) 

5.30  (0.68) 
4.35  (1.48) 
4.36(1.55) 
5.50  (0.54) 

by  high  perceived  caring,  moderate  success  expectancies,  and  very 
low  situational  interest  and  perceived  usefulness. 

Students  in  Cluster  2  reported  that  they  moderately  identified 
with  science  and  put  forth  a  moderate  amount  of  effort  in  science 
class,  similar  to  Cluster  3.  However,  their  desire  to  take  more 
science  classes  and/or  to  pursue  science-related  careers  was  low. 
When  students  do  not  intend  to  persist,  they  tend  to  report  lower 
values  like  usefulness  and  interest  (Meece,  Wigfield,  &  Eccles, 
1990;  Schunk  &  Pajares,  2005),  and  less  autonomy  (Deci  &  Ryan, 
2000;  Schunk,  1995),  such  as  these  students  reported  for  their 
science  classes.  Overall,  these  students  may  not  have  perceived 
science  to  be  useful  or  experienced  much  enjoyment  and  situa¬ 
tional  interest  in  science  class,  and  felt  little  empowerment  with 
respect  to  their  learning,  but  they  asserted  some  effort  in  class, 
expected  to  do  fairly  well,  and  believed  that  their  teachers  cared 
about  their  academic  and  personal  well-being.  Even  though  these 
students  did  not  have  many  intentions  to  persist  in  science,  it  is 
possible  that  their  moderate  effort  in  class  could  have  been  moti¬ 
vated  by  other  factors,  such  as  their  expectations  for  success  (Cox 
&  Whaley,  2004;  Gendolla,  Wright,  &  Richter,  2012;  Greene, 
DeBacker,  &  Krows,  1999;  Pajares,  1996),  high  perceptions  of 
teacher  caring  (Wentzel,  1997),  and/or  more  external  factors  we 
did  not  measure,  such  as  attaining  a  specific  grade. 


Cluster  3:  Somewhat  high  motivation  profile.  Students  in 
Cluster  3  reported  that  they  perceived  only  somewhat  high  levels 
of  empowerment,  usefulness,  success,  interest,  and  caring  in  their 
science  classes.  We  propose  that  a  lack  of  highly  positive  caring 
and  success  beliefs  differentiates  this  profile  from  Clusters  2  and  4. 

Students  in  Cluster  3  reported  moderate  science  identification 
and  science  class  effort,  similar  to  Cluster  2;  however,  unlike 
Cluster  2,  they  reported  only  somewhat  low  intentions  to  persist  in 
science.  Overall,  these  data  suggest  that  somewhat  high  motivation 
for  science  class  in  this  case  seem  to  be  associated  with  similarly 
moderate  to  moderately  low  science  identification,  intentions  to 
persist  in  science,  and  effort  put  forth  in  class. 

Cluster  4:  Somewhat  high  motivation,  and  high  success  and 
caring  profile.  Cluster  4  reported  a  somewhat  high  amount  of 
empowerment,  moderate  to  high  perceptions  of  usefulness  and 
interest,  and  that  they  expected  to  be  very  successful  and  perceived 
a  high  level  of  caring  in  science  class.  However,  Cluster  4  ex¬ 
pressed  somewhat  low  science  career  goals  and  science  course 
intentions.  Regardless  of  their  low  intentions  to  persist  in  the  field, 
they  valued  science  as  an  important  part  of  who  they  were  and, 
likewise,  put  forth  a  lot  of  effort  in  science  class.  Overall,  these 
students  held  only  moderate  perceptions  of  usefulness  and  interest, 
and  it  seems,  corresponding  to  these  beliefs,  they  had  few  plans  to 


6 

5 

4 

3 

2 

1 

Cluster  1:  Low  Cluster  2:  Low  Cluster  3:  Somewhat  Cluster  4:  Somewhat  Cluster  5:  High 
motivation  usefulness  &  high  motivation  high  motivation,  motivation 

interest,  but  high  high  success  & 
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V  Science  career  goals 

0  Science  course  intentions 

ffl  Science  class  effort 

Figure  8.  Cluster  comparisons  across  correlated  variables  (2014  data  set).  Each  bar  represents  one  variable  and 
each  section  represents  one  cluster. 
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pursue  science  courses  and  careers;  nevertheless,  they  placed  im¬ 
portance  on  doing  well  in  science,  tried  hard,  expected  to  be 
successful,  and  believed  that  their  teachers  cared. 

We  more  closely  examined  dissimilarities  between  Clusters  2 
and  4,  and  Clusters  3  and  4,  respectively.  In  general,  it  appears  that 
the  key  distinctions  between  Clusters  2  and  4  were  lower  perceived 
usefulness  and  interest  in  science  class  in  Cluster  2,  which  were 
negatively  associated  with  reported  effort  and  identification.  This 
is  evidenced  in  several  ways.  First,  mean  values  for  each  science 
class  perception  indicates  that  the  profiles  are  similar  except  that  in 
Cluster  4,  perceived  usefulness  and  interest  in  science  class  were 
much  higher,  and  success  expectancies  appeared  only  slightly 
higher.  Also,  perceived  empowerment  in  science  class  and  science 
career  goals  were  similar  between  profiles;  however,  those  in 
Cluster  2  reportedly  extended  significantly  less  effort  in  science 
class,  and  their  science  identification  and  intentions  to  take  future 
science  courses  were  notably  lower. 

Cluster  4  is  primarily  different  than  Cluster  3  in  that  perceived 
caring  and  success  expectancies  in  science  class  were  higher  in 
Cluster  4.  Differences  in  usefulness  and  interest  appear  more 
minor,  although  Cluster  4  was  generally  more  interested  during 
science  class.  Cluster  4  reported  similar  intentions  to  persist  as 
Cluster  3;  however,  they  reportedly  put  forth  much  more  effort  in 
science  class  and  indicated  significantly  higher  science  identifica¬ 
tion.  Thus,  these  data  may  imply  a  positive  relationship  between 
high  caring  and  success  expectancies,  and  student’s  effort  in  sci¬ 
ence  class  and  domain  identification. 

Cluster  5:  High  motivation  profile.  Students  in  Cluster  5 
reported  higher  quantity  motivation  for  science  class  than  students 
in  the  other  clusters.  Discriminant  analysis  revealed  that  their 
motivation  for  science  class  was  primarily  characterized  by  high 
situational  interest,  success  expectancies,  and  perceived  useful¬ 
ness.  Although  less  influential  in  determining  cluster  membership, 
they  also  generally  perceived  that  they  had  control  over  their 
learning  in  science  class  and  that  their  teachers  were  highly  caring. 
Students  in  this  cluster  reported  significantly  greater  perceptions 
on  all  four  outcome  variables:  they  believed  that  science  was  an 
important  part  of  their  identities,  they  put  forth  a  lot  of  effort  in 
science  class,  they  reported  that  they  wanted  to  take  science  classes 
in  the  future,  and  they  could  see  themselves  in  science-related 
careers.  These  findings  are  consistent  with  previous  research, 
which  suggests  that  students  with  high  motivation  are  often  more 
engaged  in  class,  put  forth  more  effort,  hold  positive  motivation- 
related  beliefs,  and  intend  to  persist  in  the  future  (Deci  &  Ryan, 
2000;  Hidi  &  Renninger,  2006;  Wigfield  &  Eccles,  2000).  It  is 
important  to  note  that  Clusters  1  through  4  reported  low  to  some¬ 
what  low  intentions  to  persist;  of  all  clusters,  only  Cluster  5 
reported  somewhat  high  intentions  to  persist. 

Trends  Across  Profiles 

We  identified  several  important  trends  that  can  help  to  further 
unravel  students’  motivations.  First,  we  found  that  situational 
interest  and  usefulness  tended  to  be  lower  than  perceptions  of 
success  and  caring  in  science  class  for  most  profiles.  Only  students 
in  the  high  motivation  profile  reported  high  situational  interest  and 
usefulness  for  science  class.  These  patterns  may  indicate  that, 
although  students  in  Clusters  2  and  4  believe  that  they  can  be 
successful  and  that  their  teachers  are  caring,  they  may  not  have  any 


particular  desire  to  engage  in  science  activities,  unlike  those  in 
Cluster  5  (e.g.,  Eccles,  Wigfield,  &  Schiefele,  1998).  These  results 
may  also  imply  that  situational  interest  and  usefulness  are  not  as 
critical  as  perceptions  of  success  in  terms  of  the  effort  students  put 
forth  in  class  and  their  identification  with  science,  as  multiple 
clusters  were  associated  with  somewhat  high  to  high  reports  of 
effort  in  science  class  and/or  science  identification.  This  finding 
aligns  with  literature  that  associates  success  expectancies  and 
effort  (Cox  &  Whaley,  2004;  Gendolla  et  al.,  2012;  Greene  et  ah, 
1999;  Pajares,  1996),  and  success  expectancies  and  identification 
(Osborne  &  Jones,  2011).  Only  the  cluster  with  high  perceived 
situational  interest  and  usefulness  in  science  class  (Cluster  5)  also 
reported  somewhat  high  intentions  to  persist,  which  is  consistent 
with  previous  findings  that  perceptions  of  usefulness  are  positively 
associated  with  persistence  (Jones  et  ah,  2010;  Jones,  Tendhar,  & 
Paretti,  2016;  Meece  et  ah,  1990).  Furthermore,  lack  of  situational 
interest  and  perceived  usefulness  has  been  related  to  amotivation, 
attrition,  and  other  negative  outcomes  (Ryan  &  Deci,  2000;  Le- 
gault  et  ah,  2006;  Renninger,  Hidi,  &  Krapp,  1992),  which  appears 
consist  with  the  low  to  somewhat  low  intentions  to  persist  in 
science  reported  by  all  clusters  except  Cluster  5.  As  in  Cluster  5, 
there  is  evidence  indicating  that  a  combination  of  high  situational 
interest,  success,  and  usefulness  is  positively  associated  with  per¬ 
sistence,  effort,  course  selection,  and  choice  of  college  major 
(Aschbacher,  Ing,  &  Tsai,  2014;  Eccles  et  ah,  1983) — outcomes 
especially  salient  in  the  present  study.  We  propose  that  students  in 
Cluster  5  are  more  likely  to  persist  in  science  and  perform  well  in 
the  future,  and  their  reported  intentions  seem  to  suggest  the  same. 
We  do  not  present  evidence  to  explain  whether  perceived  situa¬ 
tional  interest  and  usefulness  in  science  class  were  lower  in  Clus¬ 
ters  1  through  4  or  whether  perceived  success  and  caring  were 
simply  higher  (i.e.,  students  perceived  science  class  as  easier  and 
believed  that  their  teachers  were  caring).  Additional  research  is 
needed  to  better  understand  the  implications  of  these  findings. 

Second,  we  also  found  noteworthy  trends  concerning  empow¬ 
erment.  A  high  level  of  perceived  empowerment  in  science  class 
was  noted  only  in  Cluster  5.  This  finding  is  consistent  with 
self-determination  theory  (Deci  &  Ryan,  2000),  which  states  that 
motivation  associated  with  more  positive  outcomes  is  considered 
more  autonomous  (i.e.,  internalized  motivation  and  intrinsic  mo¬ 
tivation;  Deci  &  Ryan,  2012;  Ryan,  1995)  and  higher  quality 
(Vansteenkiste  et  al.,  2006).  Prior  investigations  indicated  that 
students  who  are  given  some  autonomy  and  hold  high  situational 
interest  and  usefulness  for  the  task — as  in  Cluster  5 — are  more 
likely  to  function  positively  (e.g.,  increased  engagement;  Assor, 
Kaplan,  &  Roth,  2002),  and  that  provision  of  fewer  choices  has 
predicted  decreased  interest  and  usefulness  perceptions  (e.g., 
Midgley  &  Feldlaufer,  1987).  Furthermore,  the  measure  we  used 
for  situational  interest  in  science  class  is  analogous  to  some  mea¬ 
sures  of  intrinsic  motivation  (e.g.,  Reeve,  1989),  implying  that 
students  in  Cluster  5  both  perceived  more  empowerment  in  science 
class  and  were  more  autonomously  motivated  than  in  all  other 
profiles.  However,  empowerment  did  not  present  as  an  influential 
variable  in  discriminant  analysis,  indicating  that  it  may  not  be  as 
important  as  the  other  variables  in  categorizing  students  into  a 
particular  profile.  In  other  words,  perceived  autonomy  was  not  the 
organizing  fact  of  clusters  in  this  study,  which  seems  to  conflict 
with  some  arguments  that  meeting  the  need  for  autonomy  is 
particularly  important  to  high  quality  motivation  (Deci  &  Ryan, 
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1985,  2012).  Still,  we  measured  quantity  of  motivation  with  the 
MUSIC  Inventory  (i.e.,  the  amount  of  control  and  freedom  the 
students  felt  they  experienced  in  science  class)  rather  than  quality. 
The  particular  significance  of  autonomy  in  self-determination  the¬ 
ory  centers  on  the  concept  of  higher  quality  motivation  (Vansteen- 
kiste  et  al.,  2006),  which  is  considered  more  autonomous  or,  in 
other  words,  volitional,  valued,  and  endorsed  by  the  sense  of  self 
(Deci  &  Ryan,  2012;  Vansteenkiste  et  al.,  2006).  Students  identi¬ 
fied  as  part  of  Cluster  5,  which  would  be  the  profile  most  indic¬ 
ative  of  a  high  quality  motivation  profile,  reported  the  most  per¬ 
ceived  empowerment  in  class.  These  students  not  only  reported  a 
high  amount  of  perceived  control  in  their  science  classes — unlike 
all  other  profiles — but  also  reported  the  highest  perceived  situa¬ 
tional  interest,  expectancies  for  success,  and  usefulness  in  science 
class,  as  well  as  the  highest  science  identification,  effort  in  science 
class,  and  intentions  to  persist  in  the  field.  These  findings  require 
further  investigation,  as  we  did  not  measure  the  direction  of  the 
relationships  in  the  present  study  nor  do  we  attempt  to  assert  causal 
relationships. 

Third,  although  post  hoc  assessment  of  differences  among  the 
MUSIC  components  across  profiles  is  considered  an  inappropriate 
test  given  the  nature  of  cluster  analysis,  a  distinction  in  the  caring 
perceptions  across  profiles  is  visibly  discernible.  There  appear  to 
be  two  fairly  consistent  “groupings,”  per  se,  for  caring:  one  that 
indicates  high  to  very  high  perceived  caring  in  science  class  (found 
in  Clusters  2,  4,  and  5)  and  a  second  indicating  somewhat  low  to 
somewhat  high  perceived  caring  in  science  class  (Clusters  1  and 
3).  Of  the  five  Clusters,  each  fell  into  one  of  these  two  groups, 
indicating  that  students  either  tended  to  think  their  science  teachers 
very  caring  or  somewhat  caring,  with  few  in  between.  It  is  beyond 
the  scope  of  the  present  study  to  further  evaluate  the  cause  of  this 
trend;  however,  it  appears  to  suggest  that  students  in  this  popula¬ 
tion  may  be  prone  to  somewhat  dichotomous  perceptions  of 
teacher  caring. 

Gender  Differences 

Disproportionately  more  female  students  were  assigned  to  Clus¬ 
ters  4  and  5  than  to  Clusters  1,  2,  and  3.  This  finding  is  consistent 
with  the  results  of  previous  cluster  analyses  that  indicated  a  higher 
proportion  of  female  students  in  similarly  high  quantity  motivation 
profiles  (Ratelle  et  al.,  2007;  Wormington  et  al.,  2012)  and  some 
investigations  of  secondary  level  students  more  generally  (Fischer, 
Schult,  &  Hell,  2013).  Our  findings  (see  Figure  7)  indicate  that 
female  students  were  generally  more  motivated  in  science  classes 
than  their  male  counterparts,  which  also  contradicts  some  research 
we  cited  previously  that  suggests  male  students  are  often  more 
motivated  in  science  courses  (Bong  et  al.,  2015;  Eccles,  2007; 
Maltese  &  Harsh,  2015;  Meece  et  al.,  2006). 

Grade  Level  Differences 

Fifth-grade  students  were  overrepresented  in  Cluster  5  and 
underrepresented  in  Clusters  1  and  3  (see  Figure  6).  This  finding 
generally  suggests  that  fifth-grade  students  were  more  highly 
motivated  than  the  older  students.  In  a  direct  inverse  of  this 
finding,  seventh-grade  students  were  underrepresented  in  Cluster 
5,  and  overrepresented  in  Clusters  1  and  3,  suggesting  that  the 
oldest  students  were  less  motivated  for  science  class  than  their 


younger  peers.  Similarly,  there  were  fewer  than  expected  sixth- 
grade  students  in  the  high  motivation  cluster.  Together,  these 
findings  are  consistent  with  previous  studies  indicating  that  moti¬ 
vation  in  science  often  declines  with  age  (Eccles  et  al.,  1993; 
Jacobs  et  al.,  2002). 

Context-Dependent  Motivation 

Of  those  students  who  completed  the  survey  at  multiple  time 
points,  few  retained  the  same  profile  during  two  or  more  years.  It 
is  important  to  note  that  the  questionnaire  items  specifically  tar¬ 
geted  perceptions  of  the  students’  current  science  classes.  Thus, 
this  finding  supports  the  notion  that  motivation-related  perceptions 
often  depend  on  the  specific  context  of  each  class,  which  previous 
research  suggests  can  be  influenced  by  teachers  and  the  educa¬ 
tional  environment  (e.g.,  Dotterer  &  Lowe,  2011;  Neiswandt  & 
Shanahan,  2008;  Steinmayr  &  Spinath,  2008;  Urdan  &  Schoen- 
felder,  2006;  Wang  &  Eccles,  2013).  If  so,  these  findings  may 
underscore  the  importance  of  teachers’  behaviors  and  instructional 
design  in  affecting  students’  motivation  in  science  classes. 

We  posit  that  the  changes  in  students’  motivation  profiles  are 
not  attributable  to  science  curriculum  differences  per  grade  level 
because  the  curriculum  at  the  schools  in  this  study  encompassed 
two  to  three  science  disciplines  per  grade  level,  and  physical 
science — sometimes  considered  the  most  rigorous  and  difficult — 
was  taught  in  all  three  grade  levels.  In  all,  this  is  significant 
information,  as  it  may  encourage  educators  to  view  students’ 
motivation  as  phenomena  that  can  change  from  year-to-year  and 
possibly  class-to-class,  rather  than  remain  fixed  and  unchangeable. 
However,  more  research  is  needed  to  support  this  notion,  as  other 
factors  may  have  contributed  to  the  transient  cluster  membership 
(e.g.,  home  life,  age/grade  level,  environmental  influences  on 
testing  days). 

We  further  suggest  that  the  changes  in  students’  profiles  may  be 
related,  in  part,  to  the  typical  waning  of  students'  motivation  over 
time,  given  that  40.9%  moved  to  a  lower  cluster  number,  37.1% 
did  not  move  to  a  new  cluster,  and  only  21.8%  moved  to  a  higher 
cluster  number.  If  students’  science  motivation  typically  decreases 
over  time  (Osborne  et  al.,  2003;  Simpson  &  Oliver,  1990),  the 
nature  of  these  profiles  would  be  such  that  we  would  expect 
students  to  move  from  higher  cluster  numbers  to  lower  cluster 
numbers  between  years.  We  could  expect  this  because,  compared 
with  the  higher  cluster  numbers,  students  in  the  lower  cluster 
numbers  reported  lower  or  similar  science  class  perceptions  (with 
the  exception  that  the  Cluster  2  scores  for  perceived  success  and 
caring  are  higher  than  the  Cluster  3  scores),  as  shown  in  Figure  3. 
These  yearly  changes  highlight  the  importance  of  teachers  and 
schools  in  targeting  strategies  consistent  with  the  components  of 
the  MUSIC  model  in  hopes  that  thrtse  strategies  will  alter  the 
documented  decline  in  motivation  for  the  sciences,  which  may 
serve  to  help  assuage  our  national  debt  in  science  professionals. 
For  example,  a  recent  study  of  an  afterschool  science  and  engi¬ 
neering  program  that  incorporated  elements  of  the  MUSIC  model 
indicated  that  those  students  who  participated  in  two  phases  of  the 
extracurricular  program  maintained  their  motivation  for  science 
over  time,  whereas  their  peers  tended  to  follow  the  expected 
decline  (Chittum  et  al.,  under  review). 
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Relationships  to  Science  Identification 

Our  findings  provide  evidence  to  support  some  of  the  relation¬ 
ships  within  the  domain  identification  model  (Osborne  &  Jones, 
2011).  As  predicted  by  theoretical  and  empirical  evidence  (Jones, 
Ruff,  et  ah,  2015;  Jones  et  al.,  2014;  Osborne  &  Jones,  2011), 
students  science  class  perceptions  of  the  MUSIC  model  compo¬ 
nents  were  statistically  related  to  their  science  identification;  and, 
in  addition,  science  identification  was  statistically  related  to  sci¬ 
ence  effort,  course  intentions,  and  career  goals.  The  fact  that  the 
higher-numbered  motivation  profiles  were  more  strongly  related  to 
higher  levels  of  science  identification  than  the  lower-numbered 
motivation  profiles  provides  further  evidence  that  these  variables 
are  positively  correlated. 

Limitations 

This  study  serves  as  a  proof  of  concept  and,  thus,  is  the  first 
step  in  investigating  class-related  motivation  profiles  of  pre- 
high  school  students  in  science  considering  science  class  per¬ 
ceptions  related  to  empowerment,  usefulness,  success,  interest, 
and  caring.  More  research  will  be  needed  to  examine  teacher 
effects,  motivation  profiles  in  different  contexts  (e.g.,  domains, 
grade  levels,  schools),  effects  of  designing  interventions  for 
motivation  profiles,  and  further  understanding  the  implications 
of  this  person-centered  approach.  We  are  unsure  how  much  of 
the  students’  perceptions  about  science  class  can  be  attributed 
to  personal  traits  and  how  much  can  be  attributed  to  the  class — 
another  area  for  future  research.  Also,  collecting  qualitative 
data  may  help  researchers  to  better  interpret  and  explain  these 
profiles. 

In  general,  studies  that  use  self-report  measures  are  inher¬ 
ently  subject  to  validity  threats  due  to  inaccurate  assessment  of 
personal  beliefs,  misunderstandings,  and  examining  perceptions 
that  may  be  novel  to  participants  and  thus  lack  prior  thought.  As 
all  students  were  enrolled  in  two  schools  in  one  rural  area,  the 
results  of  this  study  may  not  be  generalizable  to  other  students 
who  vary  significantly  from  the  participants  in  this  study.  In 
cluster  analysis  especially,  which  is  exploratory  in  nature,  clus¬ 
ter  membership  and  structure  can  vary  depending  on  the  con¬ 
text;  therefore,  we  caution  against  overgeneralizing  across  sam¬ 
ples,  environments,  and  domains  (Vansteenkiste  et  al.,  2009). 
Similarly,  the  MUSIC  model  summarizes  important  teaching 
strategies  such  that  they  can  be  communicated  to,  and  utilized 
by,  practicing  educators.  As  in  all  situations  in  which  concepts 
are  summarized,  important  information  can  be  lost;  thus,  dis¬ 
tinctions  among  the  strategies  and  theories  from  which  the 
MUSIC  model  stemmed  may  be  lost  or  made  somewhat  unclear. 
As  noted  previously,  it  is  also  possible  that  our  measures  of  the 
MUSIC  model  components  did  not  assess  the  range  of  percep¬ 
tions  that  are  possible  within  each  component.  Nonetheless,  we 
believe  that  this  study  represents  an  important  contribution  in 
understanding  how  multidimensional  motivation  perceptions 
affect  students’  outcomes  in  science  classes. 

Conclusion 

We  identified  five  different  class-related  motivation  profiles  that 
illustrate  complex  patterns  in  students  perceptions  about  science 


class  and  appear  to  manifest  differently  among  individuals.  The 
five  profiles  formed  consistent  patterns  in  students’  self-reported 
effort  in  science  class,  and  aligned  as  expected  with  theoretically 
and  empirically  correlated  variables  such  as  science  identification, 
course  intentions,  and  career  goals.  These  findings  indicate  that 
students’  perceptions  of  the  MUSIC  model  components  in  science 
class  and  their  levels  of  science  identification  are  important  to 
consider  when  trying  to  motivate  students  to  both  engage  in  their 
current  science  class  and  consider  a  science-related  career  in  the 
future. 

The  fact  that  students’  perceptions  varied  across  the  five 
MUSIC  model  components  contributes  to  the  literature  in  a  few 
ways.  First,  it  demonstrates  that  motivation  in  science  class  is  a 
multidimensional  construct  that  comprises  different  facets  that 
can  be  assessed  quickly  with  a  paper-and-pencil  questionnaire. 
Theoretically,  this  raises  questions  about  how  these  science 
class  perceptions  (i.e.,  the  MUSIC  model  components),  drawn 
from  multiple  motivation  theories,  can  be  integrated  into  a  more 
comprehensive  theory  of  students’  motivation  in  class.  Further 
research  is  needed  to  better  understand  how  the  five  MUSIC 
model  components  work  together  to  influence  students’  moti¬ 
vation  in  a  class  and  how  these  perceptions  then  affect  their 
domain  identification  and  goals.  Practically,  because  the  MU¬ 
SIC  model  components  are  important  to  students’  motivation, 
teachers  should  assess  these  components  (either  formally  or 
informally)  and  then  identify  strategies  to  help  students  develop 
positive  perceptions  of  their  classes.  The  profiles  identified  in 
this  study  may  help  educators  to  intentionally  design  instruction 
for  students  with  similar  class-related  motivation  profiles, 
rather  than  adhere  to  the  difficult  and  often  unrealistic  task  of 
targeting  each  student’s  individual  complex  needs.  Further¬ 
more,  these  findings  underscore  the  importance  of  the  teacher’s 
instructional  decisions  in  impacting  students’  motivation- 
related  perceptions.  Thus,  using  these  profiles,  teachers  may  be 
able  to  more  purposefully  affect  students’  motivation  in  science 
classrooms  and  increase  the  likelihood  that  more  students  will 
engage  in  science  class  and  consider  science-related  careers. 
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This  study  explores  various  measures  of  the  ethnic  makeup  in  a  classroom  and  their  relationship  with  student 
outcomes.  We  examine  whether  measures  of  ethnic  diversity  are  related  to  achievement  (mathematics, 
reading)  and  feeling  of  belonging  with  one’s  peers  over  and  above  commonly  investigated  composition 
characteristics.  Multilevel  analyses  were  based  on  data  from  a  representative  sample  of  18,762  elementary 
school  students  in  903  classrooms.  The  proportion  of  minority  students  and  diversity  measures  showed 
negative  associations  with  student  outcomes  in  separate  models.  Including  diversity  measures  and  the 
proportion  of  minority  students,  diversity  of  minority  students  mostly  lost  its  significance.  However,  the  results 
suggest  that  diversity  measures  may  provide  additional  information  over  and  above  other  classroom  charac¬ 
teristics  for  some  student  outcomes.  The  various  measures  of  diversity  led  to  comparable  results. 


Educational  Impact  and  Implications  Statement 

This  study  suggests  that  the  ethnic  makeup  of  classrooms  is  related  to  student  outcomes.  That  is, 
students  in  classes  with  a  higher  proportion  of  ethnic  minority  students  showed  slightly  lower 
achievement  and  feeling  of  belonging  with  one’s  peers  even  if  the  socioeconomic  status,  the 
immigrant  background  of  the  family,  cognitive  ability,  and  gender  of  the  student  is  equal.  In  addition 
to  the  proportion  of  ethnic  minority  students,  average  socioeconomic  status,  and  average  cognitive 
abilities,  we  looked  at  the  ethnic  heterogeneity  in  each  classroom  and  found  that  this  was  mostly 
independent  from  student  outcomes.  Only  for  math  we  found  a  positive  association  indicating  that 
students  in  a  more  ethnically  diverse  classroom  showed  slightly  higher  test  scores — however,  this 
slight  association  cannot  be  interpreted  as  a  causal  relationship  because  of  our  cross-sectional  design. 
The  findings  suggest  that  measures  of  heterogeneity  may  uncover  relationships  that  the  mere 
proportion  of  minority  students  which  disregards  various  ethnic  groups  in  the  classroom  is  unable  to 
show  and  open  a  discussion  on  how  to  investigate  effects  of  ethnic  diversity  in  educational  research. 
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As  societies  become  more  diverse  in  terms  of  ethnic  back¬ 
ground,  the  composition  of  the  student  body  within  educational 
systems  diversifies  as  well.  This  ethnic  makeup  of  schools  and 


classrooms  can  be  described  by  two  different  characteristics  that 
are  associated  with  each  other:  the  proportion  of  minority  students 
and  ethnic  heterogeneity  or  diversity.  Both  characteristics  have 
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been  used  in  research  to  describe  school  and  classroom  settings 
and  to  investigate  the  relationships  with  student  achievement.  In 
particular,  research  focusing  on  associations  with  student  out¬ 
comes  assumes  that  the  composition  of  the  student  body  shapes  the 
learning  environment  and  therefore  also  the  outcome  of  student 
learning. 

Various  theoretical  accounts  assume  a  negative  relationship 
between  the  proportion  of  ethnic  minority  students  and  student 
achievement  based  on  school  resources  and  mediated  by  instruc¬ 
tional  quality,  language  spoken  with  peers,  and  learning  culture 
(Driessen,  2002;  Goldsmith,  2011;  Raudenbush,  Fotiu,  &  Cheong, 
1998;  Stipek,  2004).  In  addition,  several  authors  assume  positive 
effects  of  ethnic  heterogeneity  on  achievement  because  students  in 
heterogeneous  learning  environments  encounter  and  have  to  work 
through  contradictions  and  discrepancies  in  everyday  life  and 
therefore  may  be  able  to  expand  their  intellectual  capacities  (e.g., 
Benner  &  Crosnoe,  2011;  Gurin,  Dey,  Gurin,  &  Hurtado,  2003; 
Peetsma,  Van  der  Veen,  Koopman,  &  Van  Schooten,  2006;  Tam  & 
Bassett,  2004). 

Research  exploring  the  relationship  between  the  ethnic  makeup 
of  schools  or  classrooms  and  student  achievement  shows  mixed 
results:  The  proportion  of  ethnic  minority  students  in  a  school  or 
classroom  often  has  no  or  slightly  negative  predictive  effects  on 
student  achievement  (Mickelson,  Bottia,  &  Lambert,  2013;  Van 
Ewijk  &  Sleegers,  2010a).  For  ethnic  heterogeneity,  some  studies 
report  that  a  higher  proportion  of  ethnically  heterogeneous  stu¬ 
dents  may  lead  to  higher  achievement  (e.g.,  Benner  &  Crosnoe, 
2011;  Tam  &  Bassett,  2004). 

Most  studies  have  dealt  only  with  a  broad  distinction  between 
ethnic  minority  and  majority  without  addressing  and  measuring 
ethnic  heterogeneity  or  diversity,  yet  their  authors  sometimes 
interpret  the  results  in  the  light  of  diversity.  The  present  study 
compares  various  measures  of  ethnic  composition  and  heteroge¬ 
neity  used  in  different  disciplines  with  the  goal  of  better  under¬ 
standing  the  relationship  between  the  ethnic  makeup  of  classrooms 
and  student  outcomes.  Our  aim  is  to  investigate  whether  the 
measures  of  ethnic  diversity  are  related  to  student  achievement  and 
psychosocial  outcomes  over  and  above  commonly  investigated 
characteristics  of  classroom  composition.  That  is,  we  want  to  find 
out  whether  the  proportion  of  minority  students  is  sufficient  to 
describe  effects  of  the  ethnic  makeup  or  whether  diversity  mea¬ 
sures  can  provide  additional  information. 

Relationship  Between  Characteristics  of  the  Student 
Body  and  Individual  Student  Outcomes 

Students  differ  in  their  educational  success  and  level  of  achieve¬ 
ment  outcomes.  This  variability  is  associated  with  individual  back¬ 
ground  characteristics,  such  as  cognitive  abilities,  prior  knowl¬ 
edge,  and  the  socioeconomic  background  of  their  families  and 
associated  home  learning  environment  (e.g.,  OECD,  2010).  In 
addition  to  these  individual  and  family  characteristics,  the  compo¬ 
sition  of  the  student  body  matters  for  individual  outcomes.  Al¬ 
though  some  authors  state  that  compositional  effects  may  reflect 
mere  methodological  artifacts  (e.g.,  Hauser,  1970),  current  re¬ 
search  concludes,  for  instance,  that  students  tend  to  show  higher 
achievement  in  classrooms  that  are  characterized  by  a  high  average 
prior  achievement  level  and  a  high  average  socioeconomic  status 
(SES)  of  the  student  body  (Van  Ewijk  &  Sleegers,  2010b).  How¬ 


ever,  research  is  inconclusive  on  whether  the  ethnic  composition  is 
related  to  student  outcomes  (e.g.,  Driessen,  2002),  independent  of 
the  average  prior  achievement  and  average  SES  of  the  classroom. 

The  present  article  focuses  on  the  relationship  between  the 
ethnic  makeup  of  classrooms  and  students’  achievement,  as  well  as 
psychosocial  outcomes.  More  precisely,  the  term  ethnic  makeup 
may  pertain  to  two  characteristics  that  represent  different  strands 
of  theory  and  research:  (a)  the  proportion  of  ethnic  minority 
students  as  the  measure  of  ethnic  composition  commonly  used  in 
educational  research  and  (b)  ethnic  heterogeneity  measured  by 
various  indices  analogous  to  the  concept  of  diversity  operationa¬ 
lized  in  a  large  number  of  different  disciplines. 

Definitions:  Ethnic  Composition  and  Heterogeneity 

Educational  research  that  addresses  questions  of  the  ethnic 
makeup  of  classrooms  commonly  operationalizes  the  ethnic  com¬ 
position  by  calculating  the  proportion  of  ethnic  minority  students 
in  a  classroom  or  school.  For  instance,  international  meta-analyses 
on  ethnic  composition  and  student  achievement  with  about  38 
primary  studies  consistently  distinguish  between  ethnic  minority 
and  majority  students  or  single  minority  groups  and  majority 
students  (Mickelson  et  al.,  2013;  Van  Ewijk  &  Sleegers,  2010a).  A 
compositional  effect  here  reflects  the  effect  of  the  proportion  of 
minority  students  even  after  controlling  for  the  effect  of  the  indi¬ 
vidual  minority  background  (see  Raudenbush  &  Bryk,  2002).  This 
approach  to  diversity  is  sometimes  referred  to  as  “simplistic 
majority-minority  approach”  (Budescu  &  Budescu,  2012),  be¬ 
cause  it  only  draws  a  superficial  picture  of  the  actual  ethnic 
classroom  composition.  Related  to  the  ethnic  composition  is  the 
idea  of  heterogeneity:  It  is  often  implicitly  assumed  that  a  high 
proportion  of  ethnic  minority  students  represents  a  heterogeneous 
student  body.  However,  the  majority-minority  distinction  does  not 
provide  much  information  on  heterogeneity  because  it  disregards 
the  distribution  of  various  ethnic  groups. 

The  concept  of  heterogeneity  or  diversity  plays  a  key  role  in  a  large 
number  of  disciplines,  such  as  ecology  (e.g.,  McCann,  2000),  eco¬ 
nomics  (e.g.,  Hall  &  Tideman,  1967),  organizational  psychology  (e.g., 
Hoppe,  Fujishiro,  &  Heaney,  2014;  Meyer,  in  press),  communication 
(e.g.,  Dimmick  &  McDonald,  2001),  and  geography  (e.g.,  Les  & 
Maher,  1998).  Their  operationalizations  of  diversity  can  be  used  in 
educational  research  as  well.  Yet,  they  are  less  common  in  this  field 
(as  an  example,  see  Benner  &  Crosnoe,  2011).  In  the  current  study, 
diversity  is  understood  as  “the  distribution  of  population  elements 
along  a  continuum  of  homogeneity  to  heterogeneity  with  respect  to 
one  or  more  variables”  (Lieberson,  1969,  p.  851  cited  in  Budescu  & 
Budescu,  2012;  cf.  Teachman,  1980).  This  concept  is  also  referred  to 
as  “variety  diversity”  describing  differences  in  group  compositions 
according  to  a  categorical  variable  (see  Harrison  &  Klein,  2007,  for  a 
classification  of  diversity  concepts).  In  contrast  to  compositional 
effects  mentioned  previously,  diversity  is  a  mere  classroom  level 
characteristic.  Existing  operationalizations  capture  the  information  of 
diversity  with  a  measure  that  represents  a  single,  dual,  or  threefold 
concept  (see  Junge,  1994;  McDonald  &  Dimmick,  2003;  Stirling, 
2007).  That  is,  the  measure  includes  one  or  more  of  the  following 
pieces  of  information:  number  of  categories,  distribution  of  elements 
across  categories,  and  a  numerical  distance  measure  that  expresses 
how  similar  the  various  categories  are  to  each  other.  These  pieces  of 
information  are  combined  using  relative  frequencies  (e.g.,  Simpson’s 
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D )  or  logarithms  of  those  frequencies  (e.g.,  Shannon’s  H).  The  present 
study  applies  single  and  dual  concept  measures  that  address  only  the 
number  of  categories  (i.e.,  number  of  ethnicities  present  in  a  class¬ 
room)  or  additionally  the  distribution  of  elements  across  categories 
(i.e.,  how  many  students  of  each  ethnicity  are  there  in  a  classroom,  for 
instance  included  in  the  measures  Simpson’s  D  and  Shannon’s  H). 1 
These  diversity  indices  thus  provide  more  information  on  heteroge¬ 
neity  than  the  majority-minority  distinction  because  they  reflect  the 
multitude  of  various  ethnic  backgrounds.  A  detailed  overview  of  the 
most  common  diversity  measures  can  be  found  in  Table  S.  1 ,  available 
as  online  supplemental  material.2 

These  measures  and  their  comparison  have  received  relatively  little 
attention  in  educational  research  thus  far.  Reviews  in  other  fields 
showed  that  some  measures  are  more  sensitive  to  the  number  of 
categories  (e.g.,  Junge’s  H)  or  to  changes  in  the  largest  proportion  of 
categories  (e.g.,  Simpson’s  D)  than  other  measures,  and  that  the 
measures  are  close  in  agreement  when  using  them  to  quantify  diver¬ 
sity  (McDonald  &  Dimmick,  2003).  McDonald  and  Dimmick  (2003) 
concluded  that  the  measures  Simpson’s  D  and  Shannon’s  H  are  most 
appropriate  if  one  is  interested  in  a  measure  that  is  simultaneously 
sensitive  to  the  number  of  categories  and  the  maximum  proportion  of 
categories.  A  recent  review  by  Budescu  and  Budescu  (2012)  in  the 
educational  field  that  explored  two  diversity  measures  (Simpson’s  D 
and  Shannon’s  H)  found  that  ranking  schools  according  to  these  two 
indices  did  not  lead  to  differences  in  ranking.  However,  Shannon’s  H 
was  related  slightly  stronger  to  school-level  achievement  than  Simp¬ 
son’s  D.  Overall,  there  is  a  dearth  of  research  examining  the  applica¬ 
bility  of  different  diversity  measures  in  the  educational  field  and  how 
they  are  related  to  student  outcomes.  Therefore,  the  present  study 
starts  by  describing  a  variety  of  different  diversity  measures  within  the 
preliminary  analyses  section  (see  “Comparison  and  selection  of  di¬ 
versity  measures”)  before  including  these  measures  into  the  analysis 
models. 

Proportion  of  Ethnic  Minority  Students  and  Individual 
Student  Outcomes 

Theories  and  empirical  research  on  the  proportion  of  ethnic 
minority  students  typically  address  effects  on  student  achievement 
and  assume  negative  relationships  between  these  two  characteris¬ 
tics.  Research  does  not  suggest  that  the  proportion  of  minority 
students  per  se  is  related  to  student  outcomes,  rather  that  there  are 
mediating  processes  and  associated  aspects  that  induce  these  rela¬ 
tionships.  Interrelated  factors  that  can  explain  why  the  proportion 
of  ethnic  minority  students — independent  of  the  prior  achievement 
and  socioeconomic  composition  as  well  as  individual  background 
characteristics — may  be  negatively  related  to  student  achievement 
outcomes  are  as  follows:  (a)  school  resources,  (b)  instructional 
quality,  (c)  minority  language  usage,  and  (d)  learning  culture. 

First,  ethnic  minority  students  often  have  less  access  to  schools 
with  good  resources  and  favorable  organizational  and  structural 
features  such  as  class  size,  teacher  qualifications,  and  programs 
that  encourage  learning.  They  are  more  likely  to  attend  residential 
neighborhood  schools  with  poor  resources  in  the  segregated  areas 
they  live  in  (Raudenbush  et  al.,  1998). 

Second,  classes  with  high  proportions  of  ethnic  minority  stu¬ 
dents  may  encounter  less  beneficial  learning  opportunities  in  terms 
of  instructional  quality;  for  example,  less  challenging  tasks  and  a 
less  student-oriented  climate.  This  is  based  on  the  assumption  that 


teachers  show  lower  achievement  expectations  toward  ethnic  mi¬ 
nority  students  (Ready  &  Wright,  201 1),  which,  in  turn,  may  cause 
them  to  offer  fewer  challenging  learning  opportunities  and  to 
engage  in  less  positive  interactions  with  these  students  (for  a 
meta-analysis  and  review,  see  Den  Brok  &  Levy,  2005;  Tenen- 
baum  &  Ruck,  2007).  In  addition,  teachers  in  segregated  neigh¬ 
borhood  schools  with  poor  resources  that  ethnic  minority  students 
frequent  tend  to  be  less  qualified  to  deliver  high-quality  instruc¬ 
tion.  These  inequalities  arise,  for  example,  because  the  most  qual¬ 
ified  teachers  gradually  shift  to  less-disadvantaged  schools  within 
an  area  because  they  typically  have  first  right  of  transfer  when 
vacancies  appear  (Betts,  Rueben,  &  Danenberg,  2000). 

Third,  classrooms  with  high  proportions  of  ethnic  minority 
students  also  tend  to  have  high  proportions  of  students  who  do  not 
speak  the  language  of  instruction  at  home.  As  a  consequence,  these 
students  may  be  less  able  to  support  each  other  by  explaining 
learning  materials  in  the  language  of  instruction.  Furthermore,  they 
may  not  speak  the  language  of  instruction  with  each  other  in 
situations  such  as  school  breaks,  which  results  in  fewer  learning 
opportunities  and  may  negatively  affect  students’  language-related 
achievement  (see  Driessen,  2002;  Entwisle  &  Alexander,  1994; 
Peetsma  et  al.,  2006;  Van  Ewijk  &  Sleegers,  2010a). 

Fourth,  ethnic  minority  students  may  share  values,  beliefs,  and 
behaviors  associated  less  with  learning  and  achievement  (e.g.,  nega¬ 
tive  attitudes  toward  school,  pessimism,  and  irregular  school  atten¬ 
dance).  Originally  focusing  on  motivation  of  Afro-American  students 
in  the  United  States,  Ogbu’s  (1987)  cultural  ecological  theory  as¬ 
sumes  that  minority  students  are  assigned  a  subordinate  status,  deval¬ 
uated  in  school  and  feel  vulnerable.  As  a  consequence,  these  students 
may  come  to  feel  alienated  from  school,  to  reject  educational  values, 
and  to  be  less  motivated  to  learn  (see  Kumar  &  Maehr,  2010;  Ogbu, 
2004).  In  classes  with  a  large  number  of  such  students,  peers  transmit 
these  values  and  beliefs  through  interacting  with  each  other.  Thus,  a 
less  beneficial  learning  culture  negatively  affecting  motivation  to 
learn  and  achievement  may  emerge  (see  Agirdag,  Van  Houtte,  &  Van 
Avermaet,  2012;  Goldsmith,  2011). 

The  aforementioned  factors  would  predict  a  negative  relation¬ 
ship  between  the  proportion  of  minority  students  and  student 
achievement,  but  a  few  theories  also  assume  a  positive  relationship 
with  some  psychosocial  outcomes,  such  as  feelings  of  belonging 
with  one’s  classmates  and  learning  motivation.  According  to  the 
self-determination  theory,  social  relatedness  or  belongingness  is 
one  of  the  three  basic  needs  whose  fulfillment  is  assumed  to  foster 
motivation  to  learn  (Deci  &  Ryan.  2000;  Niemiec  &  Ryan,  2009). 
Motivation  to  learn  and  feeling  of  belonging  with  one’s  classmates 
are  relevant  educational  goals  of  schooling  in  addition  to  achieve¬ 
ment  development  because  they  are  associated  with  a  number  of 
favorable  outcomes,  such  as  engagement  in  learning  activities, 
higher  school  achievement  levels,  keeping  track  in  school,  life 
satisfaction,  and  mental  health.  Especially  for  minority  students, 


1  The  present  study  does  not  include  threefold  concepts  of  diversity 
measures  because  we  do  not  focus  on  numerical  distance  measures  ex¬ 
pressing  similarity  or  dissimilarity  between  countries  of  birth  or  ethnicities 
and  because  there  are  no  appropriate  distance  measures  available. 

•  The  present  study  covers  diversity  as  a  characteristic  at  the  classroom 
level  and  does  not  include  operationalizations  such  as  the  proportion  of 
students  with  the  same  background  as  an  individual  student  at  the  student 
level  (see  Benner  &  Crosnoe,  2011;  Hoppe  et  al.,  2014). 
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minority  versus  majority  group  membership  may  act  as  a  lens 
through  which  individuals  in  a  culturally  pluralistic  society  view 
each  other  and  on  which  they  build  their  sense  of  belonging  (see 
belongingness  perspective  by  Baumeister  &  Leary,  1995;  Johnson, 
Crosnoe,  &  Elder,  2001;  Kumar  &  Maehr,  2010).  This  assumption 
mostly  refers  to  students  with  the  same  specific  ethnic  background 
as  a  source  of  belongingness,  yet  it  may  also  apply  to  majority 
versus  minority  group  membership  in  society.  According  to  social 
identity  theory  (Tajfel  &  Turner,  1986)  and  the  similarity- 
attraction  paradigm  (Byrne,  1971),  group  membership  and  feeling 
of  belonging  are  based  on  similarity  between  students.  Minority 
students  may  view  themselves  to  be  more  similar  to  each  other 
than  to  majority  student;  for  instance,  in  terms  of  multilingual 
experiences  and  immigration  history  within  the  family.  In  their 
review,  Kumar  and  Maehr  (2010)  argue  that  minority  students 
often  feel  rejected  by  peers  from  the  majority  group  in  society  and 
therefore  show  less  motivation  to  learn.  By  implication,  a  class¬ 
room  with  a  high  proportion  of  minority  students  should 
strengthen  minority  students’  feeling  of  belonging  with  their  peers 
and  facilitate  their  motivation  to  learn.  This  could  result  in  a 
u-shaped  relationship  between  the  proportion  of  minority  students 
and  students’  feeling  of  belonging  at  the  classroom  level.  In 
addition,  a  heightened  feeling  of  belonging  with  one’s  classmates 
could  positively  affect  future  student  achievement  (Christenson, 
Reschly,  &  Wylie,  2012).  Some  authors  also  assume  a  linear 
positive  relationship  between  the  proportion  of  minority  students 
and  motivational  characteristics.  These  assumptions  include,  for 
instance,  that  classrooms  with  low  proportions  of  minority  students 
may  be  more  competitive  and  focus  more  on  performance  goals 
that  are  less  favorable  for  achievement  development.  They  also 
focus  on  specific  immigrant  groups,  such  as  Asian  American 
students  in  the  United  States,  who  may  share  especially  heightened 
learning  motivation  (cf.  Zusho,  Pintrich,  &  Cortina,  2005). 

International  meta-analyses  (Mickelson  et  al.,  2013;  Van  Ewijk  & 
Sleegers,  2010a)  commonly  find  a  substantial  but  small  negative 
effect  of  the  proportion  of  minority  students  in  predicting  student 
achievement.  This  effect  varies  in  size  depending  on  the  minority 
groups  explored,  the  students’  age,  the  control  variables  included,  and 
the  operationalization  of  the  constructs.  Although  more  research  on 
mediating  processes  is  clearly  needed,  some  findings  support  the 
hypotheses  of  ethnic  inequalities  in  access  to  schools  with  high 
resources  (Raudenbush  et  al.,  1998),  of  lower  instructional  quality  in 
classrooms  with  high  proportions  of  ethnic  minority  students 
(Palardy,  2015;  Stipek,  2004),  and  a  less  favorable  learning  culture  in 
schools  with  high  proportions  of  ethnic  minority  students  and  average 
low  SES  (Agirdag  et  al.,  2012;  Goldsmith,  2011).  In  addition,  a  few 
studies  suggest  that  minority  students  look  forward  to  instruction 
more  and  believe  that  the  topics  they  learn  will  be  more  useful  in  the 
future  if  they  attend  schools  with  high  proportions  of  minority  stu¬ 
dents  (Goldsmith,  2004).  Furthermore,  language  minority  students 
showed  to  be  more  motivated  to  learn  in  language  lessons  in  class¬ 
rooms  with  higher  proportions  of  language  minority  students  (Rjosk, 
Richter,  Hochweber,  Liidtke,  &  Stanat,  2015). 

Ethnic  Diversity  and  Individual  Student  Outcomes 

According  to  Piaget’s  (1977)  concept  of  disequilibrium,  ethnic 
diversity  should  have  positive  effects  on  students’  cognitive  develop¬ 
ment.  Being  faced  with  new  information  that  does  not  fit  into  one  s 


schemas — for  instance,  through  exposure  to  multiple  perspectives 
from  people  with  varying  ethnic  backgrounds — induces  a  state  of 
unpleasantness.  This  state  drives  the  learning  process  through  assi¬ 
milation  and  accommodation  of  new  ideas  and  thus  fosters  cognitive 
development.  As  a  consequence,  several  authors  assume  positive 
effects  of  ethnic  diversity  in  the  classroom  on  students’  achievement 
(Benner  &  Crosnoe,  2011;  Gurin  et  al.,  2003;  Tam  &  Bassett, 
2004).  This  line  of  argument  is  similar  to  the  information/decision¬ 
making  perspective  taken  in  organizational  psychology  to  explain 
positive  effects  of  work  team  diversity  (see  Meyer,  in  press). 
Studies  in  educational  research  found,  for  instance,  that  students  in 
ethnically  more  heterogeneous  kindergartens  showed  higher 
achievement  levels  in  mathematics  and  reading,  after  controlling 
for  the  socioeconomic  composition,  proportion  of  minority  stu¬ 
dents,  and  other  school  characteristics  (Benner  &  Crosnoe,  2011). 
Students  from  ethnically  diverse  high  schools  also  showed  univer¬ 
sity  grade  point  averages  (GPAs)  in  the  first  semester  that  were 
one-fourth  to  one-half  point  higher  than  that  of  students  from  a 
nondiverse  high  school,  after  controlling  for  achievement  compo¬ 
sition  and  quality  of  high  schools  (Tam  &  Bassett,  2004). 

Although  positive  relationships  are  assumed  between  ethnic 
diversity  on  the  one  hand  and  cognitive  development  as  well  as 
school  achievement  on  the  other,  a  negative  association  between 
diversity  and  students’  feeling  of  belonging  with  their  classmates 
may  exist  (Benner  &  Crosnoe,  2011;  Benner,  Graham,  &  Mistry, 
2008).  According  to  the  belongingness  perspective  mentioned 
previously  (see  Baumeister  &  Leary,  1995;  Byrne,  1971;  Tajfel  & 
Turner,  1986),  ethnic  diversity  should  be  negatively  associated 
with  attachment  to  the  peers.  These  theories  assume  that  students’ 
sense  of  belonging  is  more  strongly  related  to  the  specific  ethnicity 
of  the  peers  than  to  the  broader  category  of  minority  status.  For 
instance,  in  a  study  on  U.S  ninth-graders,  students  perceived  the 
school  climate  to  be  fairer  and  more  directed  toward  academics 
and  interracial  understanding  if  they  attended  ethnically  less  di¬ 
verse  schools  (Benner  et  al.,  2008).  Furthermore,  ethnically  less 
diverse  schools  and  classrooms  were  characterized  by  stronger 
attachment  to  school  (Johnson  et  al.,  2001)  and  lower  levels  of 
perceived  cultural  discrimination  (Seaton  &  Yip,  2009). 

In  sum,  educational  research  commonly  assumes  a  negative 
relationship  between  the  proportion  of  ethnic  minority  students 
and  student  achievement.  Empirical  findings  indicate  that  the 
relationships  may  be  as  predicted,  yet  there  is  not  much  research 
on  the  underlying  mechanisms  inducing  compositional  effects. 
However,  some  studies  also  find  a  positive  relationship  between 
ethnic  classroom  diversity  and  students’  achievement.  For  stu¬ 
dents’  feeling  of  belonging  with  their  classmates  as  a  psychosocial 
outcome,  theories  predict  a  negative  relationship  between  hetero¬ 
geneity  and  feeling  of  belonging  and  a  u-shaped  relationship 
between  the  proportion  of  minority  students  and  feeling  of  belong¬ 
ing.  While  the  former  assumption  is  supported  by  some  studies,  the 
latter  has  not  been  investigated  to  our  knowledge. 

To  conclude,  the  two  strands  of  theories  and  findings  pre¬ 
sented  so  far  would  lead  to  contradictory  predictions:  A  high 
proportion  of  ethnic  minority  students  in  a  classroom  has  been 
shown  to  be  negatively  related  to  student  achievement.  Simul¬ 
taneously,  a  high  proportion  of  ethnic  minority  students  partly 
corresponds  to  a  higher  ethnic  diversity  which  is  assumed  to  be 
positively  related  to  school  achievement.  This  raises  the  ques¬ 
tion  of  how  the  ethnic  composition  is  related  to  school  achieve- 
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ment  when  one  tries  to  disentangle  the  effects  of  the  proportion 
of  ethnic  minority  students  and  ethnic  diversity.  Furthermore, 
we  are  interested  in  the  question  of  how  ethnic  diversity  and  the 
broader  distinction  between  ethnic  majority  and  ethnic  minority 
students  are  related  to  students’  feeling  of  belonging  with  their 
classmates  as  an  example  of  a  non-cognitive  student  outcome. 
Because  theories  suggest  that  the  feeling  of  belonging  to  one’s 
classmates  should  be  positively  related  to  student  achievement 
(Christenson  et  al.,  2012),  and  could  possibly  mediate  the 
relationship  between  classroom  composition  and  achievement 
outcomes,  we  will  further  explore  these  relationships  in  a  full 
model  including  various  student  outcomes. 

The  Present  Study:  Research  Questions 
and  Hypotheses 

The  aim  of  the  present  study  is  to  examine  the  relationship 
between  the  ethnic  makeup  of  classrooms  and  student  achievement 
in  mathematics  and  reading  comprehension  and  students’  feeling 
of  belonging  with  their  classmates.  The  competing  theoretical 
assumptions  and  empirical  findings  presented  in  the  last  sections 
form  the  basis  of  our  study.  We  examine  the  relationships  between 
student  outcomes  and  the  proportion  of  minority  students,  as  well 
as  ethnic  diversity  in  German  elementary  school  classrooms  in  a 
cross-sectional  design.  That  is,  our  analyses  provide  information 
on  associations  between  classroom  characteristics  and  student  out¬ 
comes,  but  do  not  allow  drawing  conclusions  about  causal  effects. 
The  average  SES  in  a  classroom  and  the  average  prior  achievement 
are  classroom-level  background  variables  that  have  been  shown  to 
matter  for  individual  achievement  outcomes  in  former  studies 
(e.g..  Van  Ewijk  &  Sleegers,  2010b).  Our  analyses  include  these 
variables  as  covariates.  We  take  the  average  prior  achievement  into 
account  using  the  average  cognitive  abilities  in  the  classroom  as  a 
proxy.  Our  research  questions  and  hypotheses  are  as  follows: 

(1)  Relationship  with  achievement  scores,  (la)  Is  the  pro¬ 
portion  of  ethnic  minority  students  in  a  classroom  related  to 
students’  achievement  in  mathematics  and  reading  compre¬ 
hension?  We  predict  that  the  proportion  of  minority  students  will 
be  negatively  related  to  achievement  outcomes.  The  underlying 
assumption  is  that  in  classrooms  with  high  proportions  of  minority 
students,  there  is  a  less  favorable  learning  environment  character¬ 
ized  by  poor  school  resources,  lower  instructional  quality,  non- 
German  language  usage  with  peers,  and  less  favorable  learning 
culture  (see  “Proportion  of  ethnic  minority  students  and  individual 
student  outcomes”). 

(lb)  Is  ethnic  diversity  in  the  classroom — operationalized  by 
various  diversity  measures — related  to  students’  achievement 
in  mathematics  and  reading  comprehension?  Contrary  to  some 
studies  reviewed  in  the  last  sections,  we  predict  a  negative  rela¬ 
tionship  between  ethnic  diversity  measures  and  achievement  for 
the  same  reasons  as  those  as  described  in  Hypothesis  la. 

(lc)  Do  diversity  measures  explain  additional  variance  in 
student  achievement  over  and  above  the  proportion  of  ethnic 
minority  students?  That  is,  we  ask  whether  the  proportion  of 
minority  students  is  sufficient  to  investigate  associations  with  the 
ethnic  makeup  in  terms  of  diversity  or  whether  additional  mea¬ 
sures  provide  further  information.  We  predict  that  measures  of 
ethnic  diversity  provide  additional  information  on  the  classroom 
composition  and  therefore  will  be  related  to  student  achievement 


outcomes  over  and  above  the  proportion  of  ethnic  minority  stu¬ 
dents  in  a  classroom.  We  furthermore  assume  that  controlling  for 
the  proportion  of  ethnic  minority  students  in  a  classroom  and  other 
background  characteristics  also  controls  for  characteristics  of  the 
learning  environment  associated  with  classroom  composition. 
Holding  that  constant  might  offer  the  opportunity  to  investigate 
whether  diversity  is  positively  associated  with  achievement  as 
predicted  in  the  literature.  We  predict  additional  positive  relation¬ 
ships  between  achievement  scores  and  ethnic  diversity. 

(2)  Relationship  with  feeling  of  belonging  with  one’s  peers. 
(2a)  Is  the  proportion  of  ethnic  minority  students  related  to 
students’  feeling  of  belonging  with  their  classmates?  We  as¬ 
sume  that  the  proportion  of  ethnic  minority  students  reveals  dif¬ 
ferent  relationships  with  the  feeling  of  belonging  for  minority 
students  than  for  majority  students.  According  to  the  belonging¬ 
ness  hypothesis,  minority  students  should  feel  more  attached  in 
classrooms  with  high  proportions  of  minority  students,  and  major¬ 
ity  students  should  feel  more  attached  in  classrooms  with  high 
proportions  of  majority  students.  That  is,  we  predict  an  interaction 
between  the  proportion  of  minority  students  and  individual  minor¬ 
ity  status.  In  line  with  this  assumption,  the  proportion  of  minority 
students  should  be  on  average  related  to  the  feeling  of  belonging 
with  one’s  peers  in  a  u-shaped  manner. 

(2b)  Is  ethnic  diversity  in  the  classroom — operationalized  by 
various  diversity  measures — related  to  students’  feeling  of  be¬ 
longing  with  their  classmates?  We  assume  that  ethnic  diversity  is 
negatively  related  to  the  feeling  of  belonging  because  a  large 
diversity  in  a  classroom  corresponds  to  a  low  number  of  students 
from  the  same  ethnic  background  (see  “Ethnic  diversity  and  indi¬ 
vidual  student  outcomes”). 

(2c)  Do  diversity  measures  explain  additional  variance  in 
students’  feeling  of  belonging  with  the  peers  over  and  above 
the  proportion  of  ethnic  minority  students?  That  is,  we  ask 

whether  the  proportion  of  minority  students  is  sufficient  to  inves¬ 
tigate  associations  with  the  ethnic  makeup  in  terms  of  diversity  or 
whether  additional  measures  provide  further  information.  We  as¬ 
sume  that  ethnic  diversity  explains  additional  variance  in  students’ 
feeling  of  belonging  with  the  peers  over  and  above  the  proportion 
of  ethnic  minority  students  as  students  should  build  their  sense  of 
belonging  on  the  specific  ethnicity  of  the  peers  in  their  classroom 
rather  than  on  the  broader  category  of  minority  status  (see  “Ethnic 
diversity  and  individual  student  outcomes”). 

As  a  last  step,  we  explore  the  relationship  between  the  ethnic 
makeup  of  classrooms,  achievement  measures,  and  students’  feel¬ 
ing  of  belonging  with  their  classmates  in  a  full  model  (i.e.,  includ¬ 
ing  both  ethnic  makeup  and  feeling  of  belonging  as  predictors  of 
achievement)  to  gain  first  insights  into  possibly  mediating  effects 
of  students  feeling  of  belonging. 

Method  t 

v 

Participants 

Our  analyses  are  based  on  data  from  a  nationally  representative 
sample  of  elementary  school  students  in  Germany  who  participated  in 
the  201 1  National  Assessment  Study  of  student  achievement  in  ele¬ 
mentary  schools  ( IQB-Landervergleich ;  Stanat,  Pant,  Bohme,  & 
Richter.  2012)  of  the  German  Institute  for  Educational  Quality  Im¬ 
provement  (IQB).  The  data  include  27,081  students  of  complete 
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fourth-grade  classrooms  in  1,349  randomly  selected  German  public 
schools  (see  Richter  et  ah,  2012). 

We  excluded  special  needs  schools,  schools  from  the  former  Ger¬ 
man  Democratic  Republic  (GDR)— because  they  have  very  low 
proportions  of  ethnic  minority  students  (see  Federal  Statistical  Office 
Germany,  2012) — and  classrooms  with  a  large  number  of  missing 
values  for  ethnic  background  information  (see  “Missing  data  treat¬ 
ment”).  As  a  consequence,  our  analyses  were  based  on  18,762  students 
attending  903  classrooms  in  903  schools  (average  number  of  students  per 
classroom,  M  =  21).  For  a  sample  description,  see  Table  1. 

Measures 

We  use  information  from  standardized  achievement  tests,  stu¬ 
dent  questionnaires,  and  parent  questionnaires,  which  were  com¬ 
pleted  anonymously. 

Student-level  independent  variables. 

Ethnic  background  of  students.  We  categorized  the  ethnic 
background  of  students  using  information  from  the  parent  question¬ 
naire.  If  the  parent  response  was  missing,  we  used  information  from 
the  student  questionnaire  (see  “Missing  data  treatment”).  The  ethnic 
background  was  categorized  based  on  the  country  where  the  parents 
were  bom.  To  be  categorized  as  a  student  with  minority  status,  at  least 
one  parent  had  to  be  bom  abroad.  For  instance,  a  student  with  a 
Turkish  background  has  parents  who  were  both  bom  in  Turkey  or  one 
parent  in  Turkey  and  one  parent  in  Germany.  If  one  parent  was  bom 
in  Turkey  and  one  parent  in  another  country  (i.e.,  not  Germany),  the 
student  was  assigned  to  the  category  “other  country.”  Within  the 
category  “other  country,”  countries  most  represented  were  Iran  and 
Arab  countries.  In  our  analyses,  we  distinguish  six  groups  (Table  1) 
corresponding  to  the  largest  groups  in  this  sample  (see  Stanat  et  al., 
2012).  If  questionnaire  information  was  completely  missing,  we  did 
not  exclude  the  student— because  it  would  distort  the  classroom-  level 
analyses — but  assigned  him  or  her  to  the  category  “unidentifiable,” 
which  we  included  in  the  analyses  (Table  1).  For  analyses  comparing 
broadly  ethnic  majority  and  ethnic  minority  students,  students  with  a 
German  background  (i.e.,  both  parents  bom  in  Germany)  were  catego¬ 
rized  as  “ethnic  majority”  and  the  remaining  groups  as  “ethnic  minority.” 


Students’  SES.  As  the  measure  of  SES  we  used  the  highest 
International  Socio-Economic  Index  of  Occupational  Status  (HISEI; 
Ganzeboom,  2010;  Ganzeboom,  De  Graaf,  Treiman,  &  De  Leuuw, 
1992).  This  index  is  a  classification  of  parents’  occupation  based 
on  income  and  education,  with  a  score  range  of  10  (e.g.,  a  kitchen 
helper)  to  89  (e.g.,  a  medical  doctor).  We  used  information  about 
the  current  occupation  that  the  parents  provided  in  the  question¬ 
naire  in  an  open  answer  format. 

Cognitive  abilities.  To  approximate  students’  prior  achieve¬ 
ment  within  the  cross-sectional  design,  we  used  the  figural  subtest 
of  the  cognitive  abilities  test  (KFT  4-12  +  R;  Heller  &  Perleth, 
2000;  see  Baumert,  Stanat,  &  Watermann,  2006,  on  the  validity  of 
such  a  proxy).  This  standardized  test  is  a  commonly  used  cognitive 
abilities  test  in  Germany  and  comparable  to  the  cognitive  abilities 
test  (CAT)  by  Thorndike  and  Hagen  (1971,  1993).  We  use  the 
figural  analogies  subscale  consisting  of  25  items  in  which  students 
are  asked  to  choose  one  figure  out  of  five  possibilities  in  analogy 
to  a  given  pair  of  figures.  The  test  authors  describe  the  reliability 
and  validity  of  this  measure  as  very  satisfactory  (internal  consis¬ 
tency  of  figural  subscale:  a  =  .92;  retest  reliability  of  the  whole 
test  consisting  of  figural,  verbal,  and  numerical  parts  after  2  years: 
rtt  =  .83;  average  predictive  validity  of  the  whole  test:  correlation 
of  r  =  .41  with  GPA  of  higher  education  entrance  certification  up 
to  8  years  later).  Internal  consistency  of  the  subscale  in  the  analysis 
sample  was  a  =  .93.  Correlations  with  achievement  outcome 
measures  in  the  analysis  sample  were  rmathematics  =  .54  and 
r  =  40 

'reading 

Gender.  Student  gender  was  recorded  in  the  tracking  form 
completed  by  the  classroom  teacher  (dummy  coding;  male  =  0, 
female  =  1). 

Classroom  level  independent  variables.  The  key  classroom 
level  variables  of  this  study  were  the  proportion  of  ethnic  minority 
students  in  a  classroom  and  ethnic  diversity.  Covariates  were  the 
classroom  level  SES  and  cognitive  abilities. 

Proportion  of  ethnic  minority  students.  We  calculated  the 
proportion  of  ethnic  minority  students  in  each  classroom  in  accor¬ 
dance  with  the  individual  student  level  categorizations  described 


Table  1 


Descriptive  Sample  Statistics  for  Demographic  Variables 


Variable 

M  (SD) 

Minimum 

Maximum 

Individual  level  (LI),  N  =  18,762 

Ethnic  minority  status 

37.60% 

— 

— 

SES 

50.23  (16.19) 

10 

89 

Female 

49.50% 

— 

— 

Age  in  years 

10.41  (.50) 

6.83 

13.17 

Classroom  level  (L2),  N  =  903 

Proportion  of  German  background 

60.52%  (22.08) 

0% 

100% 

Proportion  of  Turkish  background 

6.53%  (9.50) 

0% 

61.11% 

Proportion  of  former  USSR  background 

4.91%  (7.63) 

0% 

55.56% 

Proportion  of  Polish  background 

2.13%  (4.00) 

0% 

41.67% 

Proportion  of  former  Yugoslavia  background 

2.96%  (4.61) 

0% 

29.41% 

Proportion  of  other  background 

9.71%  (9.09) 

0% 

52.94% 

Proportion  of  students  with  missing 

background  information  (“unidentifiable”) 

13.24%  (10.67) 

0% 

48.15% 

Proportion  of  ethnic  minority  students 

39.48%  (22.08) 

0% 

100% 

SES 

49.64(8.15) 

25.75 

78.07 

Note.  SES  =  socioeconomic  status.  For  operationalization  of  SES  and  students’  ethnic  background,  see 
Measures  section. 
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previously  as  relative  frequency  (see  the  next  paragraph  “Ethnic 
diversity”  for  an  example). 

Ethnic  diversity.  We  operationalized  ethnic  classroom  diversity 
calculating  various  diversity  measures  (Table  S.l)  based  on  individual 
ethnic  background  information  described  previously.  In  the  following, 
we  explain  how  four  exemplary  measures  to  operationalize  the  ethnic 
makeup  of  classrooms  are  computed:  the  proportion  of  ethnic  minor¬ 
ity  students  and  three  diversity  measures — number  of  categories, 
Simpson’s  D,  and  Shannon’s  H. 

Imagine  two  fictitious  classrooms,  Classroom  A  and  Classroom  B. 
Both  classrooms  have  20  students.  In  Classroom  A,  there  are  6 
students  with  German  background,  13  with  Turkish  background,  and 
1  with  Polish  background,  hi  Classroom  B,  there  are  6  students  with 
German  background,  5  with  Turkish  background,  4  with  parents  from 
the  former  USSR,  4  with  parents  from  the  former  Yugoslavia,  and  1 
with  another  ethnic  background.  The  proportion  of  ethnic  minority 
students  in  Classroom  A  is  0.7  (propA  =  (13  +  l)/20  =  0.7),  and  in 
Classroom  B  it  is  also  0.7  (propB  =  (5  +  4  +  4  +  1  )/20  =  0.7).  The 
number  of  ethnic  groups  in  Classroom  A  equals  3  (NcatA  =  German, 
Turkish,  Polish  =  3),  and  in  Classroom  B  it  equals  5  (NcatB  = 
German,  Turkish,  former  USSR,  former  Yugoslavia,  other  =  5).  We 
now  use  this  scenario  to  illustrate  further  diversity  measures. 

Simpson’s  D  represents  the  probability  that  two  students 
selected  at  random  from  a  classroom  belong  to  different  eth¬ 
nicities — thus  the  greater  the  value  of  D,  the  greater  the  diver¬ 
sity.  Its  minimum  value  is  0,  and  its  maximum  is  achieved  when 
the  distribution  across  the  c  ethnic  groups  in  the  classroom  is 
uniform  (in  our  study,  c  =  7,  i.e.,  Dmax  —  (7  -  l)/7  =  0.857). 
Shannon’s  H  involves  a  logarithmic  transformation  of  proba¬ 
bilities  {Hmin  =  0;  //max  in  our  study  =  ln(c)  =  ln(7)  =  1.946). 
Simpson’s  D  and  Shannon’s  H  both  use  the  relative  frequencies 
( Pi )  of  each  ethnic  group  in  the  classroom.  For  Classroom  A,  the 
relative  frequencies  are  pA.German  =  0.3,  pA.Turkish  =  0.65,  and 
Pa. Polish  =  0.05,  and  for  Classroom  B  they  are  pB,Cerman  =  0.3, 
Pb. Turkish  =  0.25,  Pb.USSR  =  0.2,  Pb. Yugoslavia  —  0.2,  and 
Pb  other  =  0.05.  The  measure  Simpson’s  D  for  Classroom  A 
equals  D A  =  0.485  (DA  =  1  -  2  pj  =  1  —  ((0.3  X  0.3)  + 
(0.65  X  0.65)  +  (0.05  X  0.05))  =  1-0.515  =  0.485)  and  for 
Classroom  B  DB  =  0.765  (DB  =  1-  Xpf=l-  ((0.3  X  0.3)  + 
(0.25  X  0.25)  +  (0.2  X  0.2)  +  (0.2  X  0.2)  +  (0.05  X  0.05))  = 
1  -  0.235  =  0.765).  The  measure  Shannon’ s  H  for  Classroom  A 
equals  HA  =  0.791  (HA  =  -  2  pt  In (pf  =  -  ((0.3  X  ln(0.3))  + 
(0.65  X  ln(0.65))  +  (0.05  X  ln(0.05)))  =  0.791)  and  for 
Classroom  B  it  equals  HB  =  1.501  (//B  =  —  £/?,•  ln(p,)  =  — 
((0.3  X  ln(0.3))  +  (0.25  X  ln(0.25)  +  (0.2  X  ln(0.2)  +  (0.2  X 
ln(0.2))  +  (0.05  X  ln(0.05)))  =  1.501). 

In  sum,  our  example  shows  that,  even  though  the  proportion  of 
minority  students  in  Classroom  A  and  B  are  the  same,  their  student 
body  differs  in  terms  of  ethnic  homogeneity.  Classroom  B  is 
ethnically  more  diverse  than  Classroom  A,  which  is  reflected  in  a 
larger  number  of  ethnicities  in  the  classroom  and  higher  values  of 
Simpson’s  D  and  Shannon’s  H  (Table  2). 

Classroom  level  socioeconomic  status.  We  aggregated  the 
average  SES  score  for  each  classroom  based  on  the  individual 
student  scores  (see  “Student-level  independent  variables”). 

Classroom  level  cognitive  abilities.  We  aggregated  the  aver¬ 
age  cognitive  abilities  test  score  for  each  classroom  as  a  proxy  for 


Table  2 

Ethnic  Makeup  of  Two  Different  Classrooms: 
Exemplary  Computations 


Variable 

Classroom  A 

Classroom  B 

Proportion  of  minority  students 

.70 

.70 

Number  of  ethnic  groups 

3 

5 

Simpson’s  D 

.485 

.765 

Shannon’s  H 

.791 

1.501 

Note.  Both  classrooms  involve  20  students  but  differ  in  their  distribution  of 
ethnic  makeup  (Classroom  A:  6,  13,  and  1;  Classroom  B:  6,  5,  4,  4,  and  1). 

prior  achievement  based  on  the  individual  student  scores  (see 
“Student-level  independent  variables”). 

Outcome  variables. 

Student  achievement  in  reading  comprehension  and 
mathematics.  Trained  test  administrators  conducted  standard¬ 
ized  achievement  tests  in  German  reading  comprehension  and 
mathematics  in  the  classrooms.  The  tests  were  designed  by  a  team 
of  experienced  teachers  and  scientists  in  partnership  with  the 
German  Institute  for  Educational  Quality  Improvement  (IQB)  to 
measure  these  achievement  domains  in  accordance  with  the  Ger¬ 
man  national  educational  standards.  These  educational  standards 
were  introduced  by  the  Standing  Conference  of  the  Ministers  of 
Education  and  Cultural  Affairs  of  the  Lander  in  the  Federal 
Republic  of  Germany  (Kultusministerkonferenz  [KMK]).  They 
describe  competencies  students  are  expected  to  have  developed  at 
a  certain  grade.  The  Institute  for  Educational  Quality  Improvement 
is  responsible  for  coordinating  the  test  development  process,  work¬ 
ing  with  teachers  and  experts  in  subject-matter  education,  and 
evaluating  the  psychometric  item  properties  based  on  pilot  studies 
with  nationally  representative  data  sets. 

Task  units  measuring  reading  comprehension  consisted  of  a 
literary  or  factual  text  of  half  a  page  to  one  and  a  half  pages  and 
several  items  with  varying  complexity.  The  items  were  mainly 
presented  as  multiple-choice  and  short-answer  questions.3  Each 
student  received  two  to  four  out  of  1 1  task  units  (booklet  design). 

The  mathematics  achievement  test  covered  the  five  content 
domains  “numbers  and  operations,”  “space  and  shape,”  “patterns 
and  structures,”  “measurement,”  and  “probability”  using  a  variety 
of  tasks,  such  as  simple  computations,  extracting  information  from 
charts,  and  reflecting  shapes  (see  Winkelmann,  van  den  Heuvel- 
Panhuizen,  &  Robitzsch.  2008). 

A  generalized  Rasch  model  was  used  to  estimate  student 
achievement  scores  on  a  common  scale  for  each  achievement 
domain  of  reading  and  mathematics  achievement.  Mathematics 
and  reading  scores  were  generated  using  the  plausible  values  (PV) 
technique  (Adams,  Wu,  &  Carstensen,  2007).  Fifteen  PVs  were 
generated  for  each  student  for  mathematics  and  reading  achieve¬ 
ment,  which  were  scaled  to  have  a  medn  score  of  500  and  an  SD 
of  100  in  the  German  student  population  (current  sample  distribu¬ 
tion  mathematics:  M  =  489.05,  SD  =  96.00,  reading:  M  =  494.12, 
SD  =  90.87).  Expected  a  posteriori  reliabilities  of  plausible  values 
(EAP/PV  reliabilities;  cf.  Adams,  2005)  in  the  calibration  model 
were  .91  (mathematics)  and  .73  (reading). 


3  For  illustrative  examples  of  test  items,  see  www.iqb.hu-berlin.de/ 
laendervergleich/LV201 1/Beispielaufgaben. 
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Feeling  of  belonging  with  one’s  peers.  Students  rated  their 
feeling  of  belonging  with  their  classmates  in  the  student  question¬ 
naire.  The  scale  is  part  of  a  questionnaire  measuring  emotional  and 
social  experiences  in  school  (Rauer  &  Schuck,  2003).  It  consists  of 
four  items  rated  on  a  4-point  Likert  scale  (1  =  “fully  disagree ”  to 
4  =  “fully  agree”)  by  all  students.  Item  examples  are  “My  class¬ 
mates  are  nice  to  me”  and  “When  I  am  sad,  my  classmates  comfort 
me”  (manifest  M  =  3.3,  SD  =  0.6,  Cronbach’s  alpha  =  .71).  We 
used  these  items  to  represent  the  latent  construct  of  belonging  with 
one’s  peers  in  a  multilevel  structural  equation  model  framework 
(see  “Data  analysis”  and  “Measurement  model  fit  and  variance  of 
outcome  variables  between  classrooms”). 

Data  Analysis 

Preliminary  analyses.  As  part  of  our  preliminary  analyses, 
we  calculated  eight  diversity  measures  (Table  S.l)  and  analyzed 
their  relationship  with  each  other  using  Pearson  correlations  (see 
“Comparison  and  selection  of  diversity  measures”). 

Analyses  for  Research  Questions  1  and  2.  We  used  struc¬ 
tural  equation  modeling  to  explore  our  Research  Questions  1  and 
2  (see  Bovaird,  2007).  Several  random  intercept  multilevel  struc¬ 
tural  equation  models  with  fixed  slope  were  estimated  to  analyze 
the  relationship  between  (a)  the  proportion  of  ethnic  minority 
students  in  a  classroom  and  student  outcomes,  (b)  ethnic  diversity 
operationalized  by  various  measures  and  student  outcomes,  as  well 
as  (c)  both  classroom  characteristics  and  student  outcomes.  We 
employed  a  stepwise  model  building  procedure.  We  used  the 
software  Mplus  (Version  6.1;  Muthen  &  Muthen,  1998-2010)  for 
all  analyses  with  the  option  “type  =  imputation”  pooling  the 
results  of  analyses  with  the  15  plausible  values  of  the  student 
achievement  measures. 

Metric  background  variables  at  the  student  level  (SES,  cognitive 
abilities)  were  standardized,  which  implies  centering  at  their  grand 
mean.  Categorical  variables  (ethnic  minority  status,  gender)  were 
neither  centered  nor  standardized.  The  estimates  of  background 
variables  aggregated  at  the  classroom  level,  that  is  proportion  of 
minority  students,  average  SES,  and  average  cognitive  abilities, 
can  be  interpreted  as  compositional  effects  (Raudenbush  &  Bryk, 
2002).  A  compositional  effect  reflects  the  effect  of  the  aggregate  of 
a  person-level  characteristic  (e.g.,  proportion  of  minority  students) 
even  after  controlling  for  the  effect  of  the  individual  characteristic 
(e.g.,  individual  minority  background;  see  Raudenbush  &  Bryk, 
2002).  Effects  of  diversity  indices  do  not  show  compositional 
effects  but  effects  at  the  classroom  level.  All  classroom  level 
variables  were  standardized  at  the  classroom  level.  We  addition¬ 
ally  included  the  quadratic  term  of  the  proportion  of  ethnic  mi¬ 
nority  students  in  a  classroom,  in  order  to  explore  a  potential 
nonlinear  relationship  between  the  proportion  and  student  out¬ 
comes  across  classrooms.  Prior  to  calculating  the  quadratic  term, 
we  centered  the  proportion  of  ethnic  minority  students  at  its  mean 
to  counteract  multicollinearity.  All  regression  coefficients  in  the 
result  tables  were  standardized  using  the  total  variance  (within  + 
between)  of  the  outcome  variable. 

For  addressing  Research  Question  2,  we  used  a  doubly  latent 
approach  with  cross-level  measurement  invariance  for  the  con¬ 
struct  of  feeling  of  belonging  with  one’s  peers.  This  approach 
comprises  a  twofold  procedure  using  latent  variables,  (a)  latent 
measurement  models  at  both  levels  and  (b)  latent  aggregation  loi 


the  classroom  level  construct  (e.g.,  Liidtke,  Marsh,  Robitzsch,  & 
Trautwein,  2011;  Marsh  et  al.,  2012).  An  important  advantage  of 
a  model  with  these  features  is  that  it  corrects  possible  measurement 
and  sampling  errors  associated  with  designs  in  which  variables 
measured  at  the  individual  level  are  used  to  operationalize  a  construct 
at  the  classroom  level.  We  focused  in  our  analyses  on  feeling  of 
belonging  with  one’s  peers  at  the  classroom  level.  However,  we 
conducted  an  additional  cross-level  interaction  analysis  in  a  random 
slope  multilevel  structural  equation  model  for  Research  Question  2a 
on  the  association  between  individual  ethnic  minority  versus  majority 
status  and  the  relationship  between  the  proportion  of  minority  students 
and  feeling  of  belonging  with  one’s  peers. 

Missing  data.  Students  in  the  sampled  schools  were  obliged 
to  participate  in  the  achievement  tests.  Yet,  individual  students 
could  be  excluded  from  the  study  by  the  school  if  they  met  one  of 
the  following  three  criteria:  (a)  students  with  permanent  physical 
impairment  that  made  it  impossible  to  participate,  (b)  severe  in¬ 
tellectual  or  emotional  impairment,  and  (c)  students  who  were  less 
than  1  year  in  Germany  and  could  neither  speak  nor  read  in 
German.  The  response  rate  of  the  student  questionnaire  in  the  total 
sample  was  87.3%;  that  is,  it  was  lower  than  the  response  rate  of 
achievement  tests  of  98.3%  because  participation  was  not  manda¬ 
tory  in  some  federal  states  of  Germany.  The  rate  of  the  question¬ 
naire  varied  between  76%  in  Hamburg  and  98%  in  Hesse.  The 
response  rate  for  the  parents’  questionnaires  was  81.4%. 

We  excluded  special  needs  schools  (N  =  51  schools),  schools 
from  the  former  GDR  (A  =  398  schools),  and  classrooms  with  a 
large  number  of  missing  values  concerning  ethnic  background 
information  (>50%;  N  =  17  schools)  from  the  analyses.  This  led 
to  a  sample  size  of  19,457  students  in  908  schools.  The  remaining 
sample  did  not  include  any  missing  values  on  the  ethnic  back¬ 
ground  variable  because  we  classified  missing  information  as 
unidentifiable  and  kept  the  student  in  the  data  set.  Information  on 
student  gender  was  missing  for  0.64%  of  the  sample,  on  cognitive 
abilities  for  6.91%,  on  mathematics  achievement  for  4.75%,  on 
reading  achievement  for  4.77%,  on  SES  for  29.62%,  and  on  all 
four  belonging  items  for  19.27%  of  the  sample.  To  deal  with  item 
nonresponse,  we  used  the  full  information  maximum  likelihood 
(FIML)  estimator  implemented  in  Mplus  for  all  variables  except 
student  gender.  This  estimator  applies  a  model-based  approach  to 
missing  data  (see  Enders,  2010),  using  all  information  available  from 
the  model  variables  to  estimate  the  model  parameters.  In  doing  so,  we 
were  able  to  use  96.43%  of  the  intended  sample.  A  total  of  3.57%  of 
the  students  had  either  missing  gender  information  or  missing  values 
on  all  estimated  variables  (SES,  cognitive  abilities,  and  outcome 
variables)  and  was  excluded  during  the  analyses. 

Results 

Preliminary  Analyses 

Comparison  and  selection  of  diversity  measures.  First,  we 
computed  the  composition  and  diversity  measures  (see  “Classroom 
level  independent  and  background  variables”  and  Table  S.l).  In¬ 
spection  of  bivariate  Pearson  correlations  at  the  classroom  level 
(see  Table  S.2  available  as  online  supplemental  material)  showed 
that  the  measures  of  ethnic  diversity  are  highly  correlated  with 
each  other  and  that  they  are  also  highly  correlated  with  the  com- 
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monly  used  “majority-minority  approach”;  that  is,  the  proportion 
of  minority  students  in  a  classroom  (first  column  in  Table  S.2). 

To  gain  further  insight  into  the  relationship  between  the  pro¬ 
portion  of  minority  students  in  a  classroom  and  the  diversity 
measures,  we  plotted  the  proportions  of  ethnic  minority  students 
for  each  classroom  and  the  values  for  Simpson’s  D  as  an  ex-ample 
(Figure  1).  The  two  measures  are  directly  dependent  on  one 
another:  In  classrooms  with  very  low  proportions  of  minority 
students  and  consequently  very  high  proportions  of  majority  stu¬ 
dents,  ethnic  diversity  is  lower.  In  classrooms  with  less  or  equal  to 
50%  of  minority  students,  the  correlation  between  the  proportion 
of  minority  students  and  Simpson’s  D  is  r  =  .99  (p  <  .01),  and  in 
classrooms  with  a  proportion  of  minority  students  greater  than  50%  it 
is  r  =  .33  (p  <  .01).  That  is,  classrooms  with  high  proportions  of 
minority  students  vary  in  their  ethnic  heterogeneity.  The  correlation 
pattern  for  other  diversity  measures  was  comparable. 

Because  the  high  intercorrelations  between  the  proportion  of 
minority  students  in  a  classroom  and  the  various  measures  of 
diversity  may  cause  problems  of  multicollinearity  and  make  it 
difficult  to  disentangle  associations  of  variables  with  these  two 
classroom  characteristics,  we  decided  to  calculate  the  diversity 
measures  without  counting  the  proportion  of  majority  students — 
that  is  German  students — as  a  category.  These  measures  thus 
depict  the  diversity  of  the  proportion  of  ethnic  minority  students. 
Both  measures  together,  the  proportion  of  minority  students  and 
such  a  diversity  index,  describe  the  overall  diversity  of  students  in 
a  classroom.  This  approach  shows  similar  results  to  analyses  that 
are  based  on  a  reduced  sample  of  classrooms  with  a  proportion  of 
minority  students  greater  than  50%  but  simultaneously  allows  us  to 
use  the  complete  sample  with  its  greater  power.  The  intercorrela¬ 
tions  among  the  adapted  measures  as  well  as  their  bivariate  cor¬ 
relation  with  the  independent  and  outcome  variables  can  be  found 
in  Table  S.3  in  the  online  supplemental  material.  These  analyses 
indicate  that  the  correlation  patterns  for  the  various  measures  of 
ethnic  diversity  are  very  similar  to  each  other.  For  reasons  of 
parsimony,  we  only  show  the  main  analyses  for  three  exemplary 


Figure  1.  Joint  distribution  of  the  proportion  of  ethnic  minority  students 
in  a  classroom  and  Simpson’s  D  for  N  =  903  classrooms.  (Students  with 
only  German  background,  i.e.,  majority  students,  were  counted  as  one 
ethnic  group  when  calculating  Simpson’s  D). 


measures  of  diversity  in  the  result  section  of  this  paper  (for  results 
of  further  analyses  with  the  remaining  measures,  see  Tables  S.7  to 
S.9,  available  in  the  online  supplemental  material). 

Measurement  model  fit  and  variance  of  outcome  variables 
between  classrooms.  For  the  feeling  of  belonging  with  one  s 
classmates  as  an  outcome,  we  first  explored  the  fit  of  the  doubly 
latent  model  with  cross-level  measurement  invariance  which 
showed  acceptable  model  fit  (x2  =  379.680,  df  =  7,  p  <  .05,  root 
mean  square  error  of  approximation  [RMSEA]  =  .058,  Compar¬ 
ative  Fit  Index  [CFI]  =  .951,- standardized  root  mean  square 
residual  [SRMR]within  =  .037,  SRMRbetween  =  .041).  Such  an 
unconditioned  model  without  any  predictors  provides  informa¬ 
tion  on  the  amount  of  variance  at  both  levels  necessary  to 
compute  the  intraclass  correlation  (ICC;  variance  at  individual 
level  =  0.31,  SE  =  0.01;  variance  at  classroom  level  =  0.03, 
SE  =  0.003;  both  variances  significantly  different  from  zero). 
The  ICC  estimates  the  proportion  of  the  total  variance  that  can 
be  attributed  to  differences  between  classrooms.  If  there  was  no 
variation  in  the  feeling  of  belonging  with  one’s  classmates 
between  classrooms,  multilevel  analyses  would  not  be  mean¬ 
ingful.  In  our  study  the  proportion  of  variance  between  class¬ 
rooms  was  9%.  As  is  commonly  the  case  for  noncognitive 
variables,  this  proportion  is  lower  than  the  variation  typically 
found  for  achievement  between  classrooms  (see  Trautwein, 
Liidtke,  Marsh,  Roller,  &  Baumert,  2006).  This  was  also  true 
for  the  current  analyses — for  mathematics  achievement,  the 
proportion  of  variance  between  classrooms  was  22%  and  for 
reading  achievement  it  was  20%. 

Relationship  Between  the  Proportion  of  Ethnic 
Minority  Students,  Ethnic  Diversity,  and  Individual 
Student  Outcomes 

Proportion  of  ethnic  minority  students  in  a  classroom  and 
student  achievement.  The  first  set  of  our  multilevel  analyses 
explored  the  relationship  between  measures  of  the  ethnic  makeup 
of  classrooms  and  students’  individual  achievement  in  mathemat¬ 
ics  and  reading  comprehension.  The  result  pattern  of  the  class¬ 
room  level  variables  is  shown  in  Table  3  for  mathematics  and  in 
Table  4  for  reading  comprehension  as  an  outcome  (for  individual 
student-level  results,  see  Table  S.4  and  S.5  in  the  online  supple¬ 
mental  material).  In  a  first  step  (Model  M.l  in  Table  3  and  Model 
R.l  in  Table  4,  see  Research  Question  la),  we  investigated  the 
association  between  the  proportion  of  ethnic  minority  students  and 
achievement  controlling  for  individual  students’  ethnic  back¬ 
ground,  gender,  cognitive  abilities,  and  SES.  The  results  show 
that — independent  of  individual  background  characteristics — a 
student  in  a  classroom  with  a  one  standard  deviation  higher  pro¬ 
portion  of  ethnic  minority  students  reached  mathematics  scores 
that  were  0.17  SDs  lower  on  average  than  a  student  in  a  class  with 
a  low  proportion  of  ethnic  minority  students  (Model  M.l  in  Table 
3).  Furthermore,  the  significant  regression  weight  of  the  quadratic 
term  of  the  proportion  of  ethnic  minority  students  in  Model  M.lb 
indicates  a  slightly  inverse  U-shaped  relationship  with  a  tendency 
of  lower  mathematics  scores  in  classrooms  with  very  low  or  very 
high  proportions  of  ethnic  minority  students.  The  same  result 
pattern  emerged  for  reading  comprehension  as  an  outcome  (see 
Model  R.l  and  R.lb  in  Table  4). 


Table  3 

Results  of  Multilevel  Structural  Equation  Models  Predicting  Mathematics  Achievement  ( Classroom-Level  Results) 
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Note.  Neat  =  number  of  categories  (i.  e.  ethnicities);  SimD  =  Simpson’s  D\  ShaH  =  Shannon’s  H\  SES  =  socioeconomic  status;  LI  =  student  level;  L2  =  classroom  level.  “Diversity  measure” 
shows  coefficients  of  the  respective  measure  indicated  in  the  first  line.  Covariates  at  the  student  level:  ethnic  background,  SES,  cognitive  abilities,  and  gender.  Regression  coefficients  were  standardized 
by  the  total  variance  (within  +  between)  of  the  outcome  variable.  For  student-level  results,  see  Table  S.5,  available  as  online  supplemental  material. 

*p  <  .05.  **p  <  .01. 
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Ethnic  diversity  in  a  classroom  and  student  achievement. 

The  next  models  (M.2  to  M.4  in  Table  3  and  R.2  to  R.4  in  Table 
4)  show  results  for  selected  diversity  measures  each  as  single 
predictor  at  the  classroom  level  (see  Research  Question  lb).  The 
first  line  of  the  result  tables  indicates  the  respective  diversity 
measure  used  (Neat  =  number  of  ethnicities  in  the  classroom, 
SimD  =  Simpson’s  D,  ShaH  =  Shannon’s  H\  see  Table  S.7  to  S.9 
in  the  online  supplemental  material  for  analyses  with  further 
diversity  measures).  Their  coefficients  are  slightly  smaller  in  size 
than  the  coefficient  for  the  proportion  of  minority  students  in  the 
classroom,  but  they  also  show  a  negative  relationship  with  indi¬ 
vidual  mathematics  and  reading  achievement.4  That  is,  in  ethni¬ 
cally  more  diverse  classrooms,  students  show  slightly  lower  levels 
of  mathematics  and  reading  achievement  scores  if  associations 
with  further  classroom  background  characteristics  are  not  con¬ 
trolled. 

Relationship  between  the  proportion  of  ethnic  minority  stu¬ 
dents,  ethnic  diversity,  and  student  achievement.  When  the 
proportion  of  ethnic  minority  students  in  a  classroom  and  a  diver¬ 
sity  measure  are  analyzed  simultaneously  as  predictors  at  the 
classroom  level  (Models  M.5  to  M.7  in  Table  3  and  Models  R.5  to 
R.7  in  Table  4;  see  Research  Question  lc)  and  also  controlling  for 
the  socioeconomic  composition  and  cognitive  abilities  level  (Mod¬ 
els  M.8  to  M.10  in  Table  3  and  Models  R.8  to  R.10  in  Table  4),  the 
coefficients  for  the  proportion  of  ethnic  minority  students  remain 
negative  and  significant  for  mathematics  achievement  and  reading 
comprehension  as  outcome.  However,  the  associations  between 
ethnic  diversity  in  the  classroom  and  student  achievement  are 
different  from  the  single  predictor  models:  After  taking  into  ac¬ 
count  the  proportion  of  minority  students  and  the  classroom  com¬ 
position  with  regard  to  cognitive  abilities  and  SES,  respectively, 
ethnic  diversity  was  slightly  positively  related  to  individual  math¬ 
ematics  achievement  (Models  M.5  to  M.10  in  Table  3).5  The 
identical  analyses  using  reading  achievement  as  an  outcome  (Mod¬ 
els  R.5  to  R.10  in  Table  4)  showed  no  significant  association  with 
ethnic  diversity.  In  all  analyses,  the  proportion  of  explained  vari¬ 
ance  did  not  differ  much  between  models  using  various  measures 
of  ethnic  diversity. 

Relationship  between  the  proportion  of  ethnic  minority  stu¬ 
dents,  ethnic  diversity,  and  students  feeling  of  belonging  with 
their  peers.  Analogous  to  the  models  presented  so  far,  Table  5 
shows  the  results  for  analyses  exploring  students’  feeling  of  be¬ 
longing  with  their  peers  as  an  outcome.  The  models  using  the 
proportion  of  ethnic  minority  students  or  a  diversity  measure  as 
single  predictor  at  the  classroom  level  controlling  for  individual 
student  characteristics  showed  negative  relationships  with  stu¬ 
dents’  feeling  of  belonging  with  their  peers  (Models  B.l  to  B.4; 
see  Research  Questions  2a  and  2b).  The  strongest  predictor  was  the 
proportion  of  ethnic  minority  students  in  a  classroom.  Its  nonsig¬ 
nificant  quadratic  term  showed  that  the  relationship  between  the 
proportion  of  ethnic  minority  students  and  feeling  of  belonging 
was  linear  (Model  B.lb). 

Partly  in  line  with  assumptions  mentioned  regarding  research 
question  2a,  additional  cross-level  interaction  analyses  (Model 
B.lc  in  Table  5)  revealed  that  individual  ethnic  majority  students 
felt  a  stronger  sense  of  belonging  with  their  peers  in  classrooms 
with  a  higher  proportion  of  ethnic  majority  students  (interaction 
term:  (3  =  .23,  SE  =  .08,  p  <  .01).  For  ethnic  minority  students  the 
expected  pattern  did  not  emerge.  Minority  students  on  average 


showed  a  weaker  feeling  of  belonging  with  their  peers  than  ma¬ 
jority  students  did  and  their  feeling  of  belonging  with  the  class¬ 
mates  did  not  vary  substantially  depending  on  the  proportion  of 
minority  students.  In  Models  B.2  to  B.4,  the  regression  coefficients 
of  the  diversity  measures  were  rather  small  and  reached  statistical 
significance  only  in  some  models.  However,  these  measures  rep¬ 
resent  diversity  within  the  group  of  ethnic  minority  students. 
Models  using  diversity  measures  including  German  students  as 
one  category  (not  presented  in  Table  5)  partly  showed  stronger 
associations  (for  Neat:  (3  =  -.04,  SE  —  .01,  p  <  .01;  for  SimD: 
(3  =  -.07,  SE  =  .01,/?  <  .01;  for  ShaH:  (3  =  -.07,  SE  =  .01,  p  < 
.01).  When  analyzed  simultaneously  as  predictors  at  the  classroom 
level  (Models  B.5  to  B.7;  see  Research  Question  2c)  and  also 
including  classroom  level  covariates  (Models  B.8  to  B.10),  the 
coefficients  of  the  proportion  of  ethnic  minority  students  remained 
negative  and  significant  and  ethnic  diversity  within  the  group  of 
minority  students  was  not  significantly  related  to  students’  feeling 
of  belonging  with  their  peers.  In  a  last  step,  we  explored  the 
relationship  between  the  ethnic  makeup  of  classrooms,  feeling  of 
belonging  with  one’s  classmates,  and  student  achievement  in  a  full 
model  (see  Table  S.10  for  mathematics  and  Table  S.l  1  for  reading 
comprehension  as  an  outcome).  Students’  feeling  of  belonging 
with  their  classmates  was  positively  associated  with  achievement 
outcomes  (for  instance  Model  M.8b:  (3mathernatics  =  H,  SE  =  .02, 
p  <  .01,  Model  R.8b:  (3reading  =  .07,  SE  =  .02,  p  <  .01),  and  there 
was  a  slight  indirect  relationship  between  the  proportion  of  mi¬ 
nority  students,  feeling  of  belonging,  and  achievement  outcomes 
(for  instance,  Model  M.8b:  (3mathematics  =  --Of  SE  =  .00,  p  <  .01, 
Model  R.8b:  (3reading  =  -.01,  SE  =  .00,  p  =  .01). 

Discussion 

The  present  study  investigated  the  relationship  between  various 
measures  of  ethnic  composition  and  heterogeneity  or  diversity 
used  in  different  disciplines  on  the  one  hand  and  achievement  and 
psychosocial  student  outcomes  on  the  other  hand.  The  aim  was  to 
explore  whether  measures  of  ethnic  diversity  are  related  to  student 
outcomes  over  and  above  commonly  investigated  characteristics  of 
classroom  composition.  We  sought  to  shed  light  on  the  question  of 
whether  the  proportion  of  minority  students  is  sufficient  to  inves¬ 
tigate  associations  between  different  outcomes  and  the  ethnic 
makeup  of  classrooms  in  terms  of  diversity  or  whether  additional 
measures  are  necessary. 

First,  we  therefore  collected  detailed  information  on  possible 
diversity  measures  from  research  conducted  in  disciplines  such  as 
communication,  geography,  and  biology.  Our  preliminary  analyses 
comparing  these  measures  operationalizing  the  ethnic  makeup  of 


4  These  indices  represent  diversity  among  minority  students  (see  section 
“Comparison  and  selection  of  diversity  measures”).  When  the  indices  are 
calculated  including  German  students  as  oiie  group,  the  coefficients  for 
mathematics  achievement  as  an  outcome  are:  Neat:  (3  =  -.07,  SE  =  .01, 
p  <  .01:  SimD:  (3  =  -.14,  SE  =  .01,/?  <  .01;  ShaH:  3  =  -.13,  SE  =  .01, 
P  <  .01  and  for  reading  achievement  as  an  outcome  they  are:  Neat:  3  = 
-.08,  SE  =  .01,/?  <  .01;  SimD:  3  =  -.13,  SE  =  .01,/?  <  .01;  ShaH:  3  = 
-.12,  SE  =  .01,/?  <  .01. 

5  The  predictors  at  the  classroom  level  were  not  highly  interrelated,  thus 
multicollinearity  was  not  a  concern  (Variance  inflation  factor  [VIF]  for  pro¬ 
portion  of  minority  students  =1.53,  for  Simpson’s  D  =  1.38,  for  average  prior 
achievement  =  1.32,  and  for  average  SES  =  1.27).  For  correlation  tables,  see 
online  supplemental  Table  S.3. 


Table  5 

Results  of  Multilevel  Structural  Equation  Models  Predicting  Feeling  of  Belonging  With  One’s  Peers  (Classroom  Level  Results) 
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classrooms  led  to  the  conclusion  that  they  are  highly  intercorre- 
lated  (see  McDonald  &  Dimmick,  2003),  which  is  why  we  con¬ 
sidered  a  selection  of  diversity  measures  within  our  main  analyses. 
Overall,  using  different  diversity  measures  as  independent  variable 
predicting  student  outcomes  led  to  comparable  results.  Further¬ 
more,  the  proportion  of  minority  students — which  is  mostly  used 
in  educational  research  to  describe  the  ethnic  makeup — was  highly 
correlated  with  diversity  measures  of  the  complete  student  popu¬ 
lation  in  a  classroom.  Thus,  even  if  diversity  and  proportion  of 
minority  students  are  not  the  same  from  a  content  perspective,  they 
should  lead  to  comparable  associations  with  student  outcomes.  In 
our  study,  we  adapted  the  diversity  measures  to  represent  diversity 
only  among  minority  students. 

Our  main  analyses  showed  that  students  reached  lower  achieve¬ 
ment  scores  in  classrooms  with  a  higher  proportion  of  ethnic 
minority  students  (Hypothesis  la).  These  findings  are  in  line  with 
international  research  (Mickelson  et  al.,  2013;  Van  Ewijk  & 
Sleegers,  2010a).  Possible  mediating  and  additional  influential 
factors — which  were  not  the  focus  of  the  present  study — that  may 
induce  such  findings  are  instructional  quality,  motivational  pro¬ 
cesses  among  peers,  non-German  language  spoken  with  peers,  and 
school  resources  such  as  organizational  structures  and  teacher 
competencies  (Agirdag  et  al.,  2012;  Palardy,  2015;  Raudenbush  et 
al.,  1998;  Stipek,  2004;  Van  Ewijk  &  Sleegers,  2010a).  Similarly, 
students  reached  lower  achievement  scores  in  ethnically  more 
diverse  classrooms.  This  result  is  in  line  with  our  hypothesis  (see 
Hypothesis  lb,  cf.  Byrnes  &  Miller-Cotto,  2016)  but  contradicts 
studies  showing  an  advantage  for  academic  achievement  develop¬ 
ment  in  ethnically  diverse  classrooms  or  schools  (e.g.,  Tam  & 
Bassett,  2004).  However,  when  the  proportion  of  ethnic  minority 
students  and  a  diversity  measure  were  analyzed  as  joined  pre¬ 
dictors,  controlling  for  average  SES  and  average  cognitive 
abilities  in  the  classroom,  we  found  different  patterns  of  results: 
After  accounting  for  differences  in  the  proportion  of  minority 
students,  ethnic  diversity  in  the  classroom  was  not  significantly 
related  to  reading  achievement,  but  did  show  a  weak  positive 
association  with  students’  level  of  mathematics  achievement. 
The  findings  are  partly  in  line  with  assumptions  of  advantaged 
achievement  development  in  ethnically  diverse  classrooms 
(Benner  &  Crosnoe,  2011;  Gurin  et  al.,  2003;  Tam  &  Bassett, 
2004).  Thus,  a  positive  association  with  mathematics  achieve¬ 
ment  became  visible  only  after  controlling  for  the  proportion  of 
minority  students,  the  level  of  socioeconomic  status  and  cog¬ 
nitive  abilities  at  the  classroom  level.  We  assume  that  control¬ 
ling  for  these  classroom  characteristics  also  controlled  for  the 
less  favorable  learning  environment  and  cumulated  disadvan¬ 
tages  in  classrooms  that  are  associated  with  a  higher  proportion 
of  ethnic  minority  students. 

The  question  arises  why  this  pattern  of  slightly  positive  associa¬ 
tions  between  diversity  and  achievement  outcomes — which  cannot  be 
interpreted  as  causal  relationships — emerged  only  for  mathematics 
achievement.  The  reliability  of  the  reading  comprehension  measure 
used  was  lower  than  that  of  the  mathematics  measure  which  might 
influence  the  findings.  It  is  possible  that  the  theoretically  assumed 
benefits  of  ethnic  classroom  diversity  develop  easily  for  mathematics 
achievement  because  it  is  strongly  tied  to  instruction  (Crosnoe  et  al., 
2010).  As  an  alternative,  it  could  be  the  case  that  a  positive  effect  of 
ethnic  diversity  emerges  for  a  large  number  of  achievement  outcomes 
but  did  not  for  reading  comprehension  because  reading  was  tested  in 
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German  and  ethnic  minority  students  might  speak  not  German  at 
home.  In  addition,  we  can  assume  that  in  ethnically  diverse  class¬ 
rooms  there  is  a  larger  number  of  different  language  backgrounds 
present  which  might  be  related  to  different  kinds  of  student  difficulties 
and  also  different  cultural  background  knowledge  needed  to  under¬ 
stand  texts.  In  such  classrooms,  it  might  be  more  difficult  for  the 
teacher  to  react  to  all  students’  needs  during  language  instruction, 
exacerbating  positive  diversity  effects.  Future  studies  should  ex¬ 
plore  additional  explanations  for  positive  associations  between 
mathematic  achievement  and  diversity.  For  instance,  it  is  possible 
that  diverse  classrooms  are  also  characterized  by  further  beneficial 
features,  such  as  school  resources  in  terms  of  learning  materials, 
teacher  competencies  and  cultural  understanding,  organizational 
structure,  and  pedagogical  concepts  (cf.  Eccles  &  Roeser,  2011). 
In  the  end.  the  current  results  do  not  allow  the  overall  conclusion 
that  ethnic  diversity  per  se  is  beneficial  for  math  learning.  The 
weak  positive  associations  found  do  not  allow  a  causal  interpre¬ 
tation  in  our  cross-sectional  design.  They  could  emerge  as  a  result 
of  the  former  mechanisms  but  also  because  of  unobserved  char¬ 
acteristics  of  the  classrooms  in  the  present  study  or  they  could  be 
related  to  the  amount  of  missing  ethnic  background  information 
and  a  classification  as  students  with  “unidentifiable”  ethnic  back¬ 
ground.  A  replication  of  the  result  pattern  in  countries  with  another 
ethnic  composition  is  needed. 

Our  multilevel  analyses  taking  students’  feeling  of  belonging 
with  one’s  peers  as  an  outcome  revealed  that  students  felt  less 
related  to  their  classmates  in  classrooms  with  a  high  proportion  of 
ethnic  minority  students  and  in  ethnically  more  diverse  class¬ 
rooms.  Additional  analyses  showed  that  individual  majority  stu¬ 
dents  felt  a  stronger  sense  of  belonging  in  classrooms  with  higher 
proportions  of  majority  students  while  minority  students  on  aver¬ 
age  reported  a  weaker  sense  of  belonging  with  their  peers  inde¬ 
pendent  of  the  proportion  of  minority  students.  This  finding  was 
partly  in  line  with  the  belongingness  perspective  predicting  higher 
sense  of  belonging  in  more  homogeneous  groups  (Baumeister  & 
Leary,  1995;  Benner  &  Crosnoe,  2011;  Byrne,  1971;  Tajfel  & 
Turner,  1986).  It  is  interesting  that  the  diversity  measures  were  not 
more  strongly  related  to  the  feeling  of  belonging  than  the  propor¬ 
tion  of  ethnic  minority  students  was.  Studies  analyzing  ethnic 
composition  and  belonging  commonly  argue  that  students  build 
their  belonging  based  on  their  specific  ethnic  background  (see 
Benner  &  Crosnoe,  2011)  rather  than  on  the  broader  distinction 
between  ethnic  minority  and  majority.  The  present  study  initially 
indicates  that  overall  minority  versus  majority  group  membership 
may  be  influential  for  students’  feeling  of  belonging  with  their 
peers.  Because  the  feeling  of  belonging  with  one’s  peers  is  an 
important  psychosocial  outcome  of  schooling  that  is  positively 
related  to  motivational  student  characteristics  (cf.  Goodenow, 
1993;  Kumar  &  Maehr.  2010),  classroom  characteristics  that  are 
prone  to  foster  a  feeling  of  belonging  should  be  further  addressed 
in  future  research.  A  further  interesting  question  that  goes  beyond 
the  scope  of  the  main  research  questions  of  this  study  is  whether 
the  feeling  of  belonging  with  one’s  peers  partially  mediates  effects 
of  the  ethnic  makeup  of  classrooms  on  achievement  outcomes  (cf. 
Christenson  et  al.,  2012).  First  supplementary  analyses  of  these 
relationships  in  full  models  indicate  a  positive  association  between 
students’  feeling  of  belonging  with  their  peers  and  achievement 
outcomes,  as  well  as  a  slight  indirect  association  between  the 
proportion  of  minority  students,  feeling  of  belonging,  and 


achievement  outcomes.  However,  in  cross-sectional  designs  it 
is  impossible  to  determine  the  direction  of  these  relationships 
and  possible  reversed  effects  (e.g.,  achievement  affects  feeling 
of  belonging). 

In  conclusion,  the  current  findings  indicate  that  using  the 
proportion  of  minority  students  or  a  diversity  index  as  measure 
of  the  ethnic  makeup  of  classrooms  mostly  led  to  comparable 
conclusions.  Including  a  diversity  index  in  addition  to  the 
proportion  of  minority  students  showed  additional  weak  asso¬ 
ciations  with  mathematics  achievement  and  no  significant  rela¬ 
tionship  with  reading  achievement  and  feeling  of  belonging 
with  one’s  peers. 

Limitations  and  Future  Research 

The  present  study  has  six  important  limitations.  First,  we  ana¬ 
lyzed  data  from  a  cross-sectional  design,  which  renders  it  impos¬ 
sible  to  make  statements  about  the  origins  and  further  development 
of  the  classroom  effects  including  the  positive  associations  found 
between  ethnic  diversity  and  mathematics  achievement.  One  im¬ 
portant  background  characteristic  of  classrooms  that  determines 
future  student  achievement  is  the  average  prior  achievement  in  a 
classroom.  We  included  it  as  a  covariate  using  cognitive  ability 
scores  as  a  proxy.  These  test  scores  were  collected  at  the  same  time 
point  as  the  outcome  variables  and  may  therefore  lead  to  a  bias 
underestimating  compositional  effects  (see  Duncan,  Magnuson,  & 
Ludwig,  2004).  At  the  same  time,  it  is  possible  that  our  analyses 
overestimated  classroom  level  associations  as  the  proportion  of 
minority  students  in  a  classroom  is  also  confounded,  for  in¬ 
stance,  with  less  favorable  residential  environments  and  segre¬ 
gated  areas  in  large  cities  (for  a  methodological  discussion  of 
compositional  effects,  see  Harker  &  Tymms,  2004;  Hauser, 
1970).  Effects  of  the  ethnic  composition  usually  are  smaller  and 
often  lack  statistical  significance  when  controlling  for  prior 
achievement  level  and  SES  level  in  German  studies  (Dumont, 
Neumann,  Maaz,  &  Trautwein,  2013).  Future  research  thus 
should  favor  longitudinal  designs  and  include  a  large  range  of 
context  characteristics. 

Second,  we  were  lacking  background  information  on  the  coun¬ 
try  of  origin  for  some  students  and  created  the  category  “uniden¬ 
tifiable”  to  include  them  into  the  analyses  in  order  to  get  a  full 
picture  of  the  complete  classroom.  This  may  have  led  to  distorted 
estimations  of  classroom  level  associations  over-  or 
underestima-ting  associations  between  outcome  variables  and 
the  ethnic  makeup  as  they  are  treated  as  one  category  besides 
other  ethnic  categories  and  we  lack  information  on  how  homo¬ 
geneous  this  group  is.  However,  excluding  all  students  with 
missing  background  information  seems  not  an  option  because  it 
might  underestimate  the  variety.  Future  research  should  gain 
this  kind  of  background  information  fpr  instance  from  school 
reports  available  for  every  student  to  avoid  missing  data.  Fur¬ 
thermore,  similar  studies  in  other  countries  with  a  different 
ethnic  composition  than  Germany  are  needed  to  disentangle 
effects  of  diversity  and  specific  group  characteristics  within  a 
system  to  a  larger  degree. 

Third,  we  only  analyzed  relationships  between  the  ethnic 
makeup  of  classrooms  and  students’  feeling  of  belonging  with 
one’s  classmates,  as  well  as  their  achievement  in  mathematics 
and  reading.  Including  a  variety  of  achievement  measures  that 
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are  more  likely  to  be  related  to  the  benefits  of  exposure  to 
diversity,  such  as  creative  thinking  and  problem  solving  (see 
Gurin  et  ah,  2003),  could  be  a  useful  addition  in  future  studies. 

Fourth,  we  do  not  know  how  the  students  in  our  study  perceived 
ethnic  diversity  in  their  classroom  and  if  their  perception  mattered 
for  their  attachment  to  classmates  and  group  formation  (for  exam¬ 
ples  of  perceived  team  diversity  in  organizational  psychology  see 
Shemla,  Meyer,  Greer,  &  Jehn,  2014).  Future  research  should 
involve  students'  point  of  view  to  a  larger  degree. 

Fifth,  there  may  be  further  student  background  characteris¬ 
tics,  such  as  SES  and  prior  achievement,  jointly  constituting 
diversity  in  addition  to  the  ethnic  background.  Recent  develop¬ 
ments  in  organizational  psychology  investigating  the  alignment 
of  multiple  diversity  attributes  and  creating  a  hypothetical  di¬ 
viding  line  between  homogeneous  groups  (“faultline”;  see 
Thatcher  &  Patel,  2012)  could  be  a  model  for  future  educational 
research  as  well. 

Finally,  our  study  pictures  only  one  aspect  of  ethnic  diversity — 
that  is,  the  distribution  of  students  in  a  classroom  according  to  their 
families’  country  of  birth  expressed  in  a  number  that  quantifies  the 
degree  of  diversity.  Future  studies  should  explore  additional  ope¬ 
rationalizations  of  and  means  to  investigate  diversity,  for  instance 
including  information  on  the  similarity  between  different  ethnic 
groups,  investigating  latent  characteristics  of  subgroups  and  inter¬ 
action  effects  between  individual  and  group  characteristics,  and 
explore  means  to  identify  mechanisms  behind  diversity  effects.  We 
are  aware  that  ethnicity  and  ethnic  identity  go  far  beyond  these 
numbers  and  would  like  to  encourage  more  quantitative  and  quali¬ 
tative  research  in  this  domain  based  on  a  diversity  of  methods.  The 
aim  of  the  present  study  was  to  explore  measures  of  the  ethnic 
makeup  of  classrooms  in  large-scale  assessment  frameworks. 
Thus,  it  provides  a  basis  for  future  research  that  is  concerned  with 
recommendations  on  school  and  classroom  composition  and  ways 
to  address  it. 

Conclusion 

The  findings  indicate  that  the  ethnic  makeup  of  classrooms 
matters  for  individual  achievement  and  psychosocial  outcomes. 
The  ethnic  diversity  measures  collected  for  this  educational  re¬ 
search  study  ended  up  being  closely  intertwined.  Thus,  using  one 
measure  or  the  other  should  lead  to  comparable  results.  The 
proportion  of  ethnic  minority  students  showed  the  strongest  rela¬ 
tion  with  student  outcomes  but  ethnic  diversity  revealed  slightly 
different  result  patterns  for  some  outcomes.  While  the  proportion 
of  ethnic  minority  students  in  a  classroom  was  negatively  associ¬ 
ated  with  individual  student  outcomes,  ethnic  diversity  was  posi¬ 
tively  related  to  mathematics  achievement  after  controlling  for 
relevant  classroom  background  characteristics  associated  with  less 
favorable  learning  environments.  Flowever,  conclusions  should 
be  drawn  cautiously.  The  slightly  positive  relationship  between 
diversity  and  mathematics  achievement  needs  replication  in 
future  research  as  it  can  be  influenced  for  instance  by  unob¬ 
served  characteristics  of  participating  student  groups.  Future 
research  in  the  field  of  education  should  not  ignore  diversity 
measures  completely.  Depending  on  the  research  question,  sub¬ 
group  and  school  subject  diversity  measures  may  give  us  more 
insight  into  how  the  ethnic  makeup  of  the  classroom  is  related 
to  student  outcomes. 
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Call  for  Papers 

A  Focused  Collection  of  Qualitative  Studies  in  the  Psychological  Sciences: 
Reasoning  and  Participation  in  Formal  and  Informal  Learning  Environments 

Journal  of  Educational  Psychology 

Guest  Editors:  Tanner  LeBaron  Wallace  and  Eric  Kuo 

Reasoning  and  participation  are  two  central  topics  of  education  research  in  the  psychological 
sciences.  Understanding  the  mechanisms  that  govern  thought  and  reasoning  has  long  been  a  core 
enterprise  of  educational  psychology  and,  over  time,  more  modem  views  on  learning  have  promoted 
participation  as  a  key  feature  for  research — either  as  a  facilitator  of  learning,  a  practice  to  be 
learned,  or  as  an  operationalization  of  learning  itself. 

We  are  pleased  to  announce  a  focused  collection  highlighting  qualitative  studies  of  reasoning  and 
participation  in  formal  and  informal  learning  environments.  By  inviting  studies  incorporating 
qualitative  methods,  we  aim  to  complement  the  experimental  and  longitudinal  statistical  research  on 
these  topics  that  is  typically  published  in  this  journal.  We  encourage  submission  of  papers  focused 
on  the  following  (or  closely  related)  topics: 

•  Student  reasoning  and/or  participation  in  novel  learning  environments  or  activities 

•  The  relations  between  student  reasoning,  motivation,  identity,  and  participation 

•  Student  perceptions  and  meaning-making  during  participatory  experiences 

•  Dynamic  models  of  student  reasoning  that  are  grounded  in  data 

•  Explanatory  accounts  for  how  and  why  participation  is  successful  (or  not) 

•  Identifying  new  goals  or  targeted  outcomes  for  reasoning  or  participation 

We  especially  welcome  qualitative  studies  that  demonstrate  the  possibilities  for  unique  discovery 
afforded  by  inductive  analysis  of  rich  data  sources  (e.g.,  real-time  recordings  of  student  reasoning, 
participation,  discourse,  and  physical  action,  students’  meaning-making  anchored  to  particular 
interactions  experienced).  This  collection  will  highlight  the  benefits  of  qualitative  methods  for 
extending  and  deepening  theoretical  and  empirical  understandings  of  reasoning  and  participation  in 
both  formal  and  informal  learning  environments. 

The  deadline  for  manuscript  submissions  is  March  1,  2018.  We  invite  authors  to  contact  the  Guest 
Editors  of  this  collection,  Tanner  LeBaron  Wallace  (twallace@pitt.edu)  and  Eric  Kuo 
(erickuo@pitt.edu),  for  discussion  on  how  to  maximize  alignment  between  their  submissions  and 
this  focused  collection,  though  it  is  not  required.  Please  follow  both  APA  guidelines  as  well  as 
specific  submission  criteria  for  the  journal.  When  submitting  manuscripts,  please  also  indicate  your 
intent  to  submit  to  this  focused  collection  in  the  required  cover  letter. 

All  manuscripts  must  be  submitted  electronically  at  http://www.editorialmanager.com/edu.  In  the 
submission  portal,  please  select  the  article  type  “Special  Section:  Reasoning  &  Participation  - 
Qualitative.”  For  more  information  on  the  Journal  of  Educational  Psychology,  please  visif  http:// 
www.apa.org/pubs/journals/edu/. 
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The  main  purpose  of  the  Journal  of  Educational  Psychology  is  to 
publish  original,  primary  psychological  research  pertaining  to  ed¬ 
ucation  across  all  ages  and  educational  levels.  A  secondary  pur¬ 
pose  of  the  Journal  is  the  occasional  publication  of  exceptionally 
important  theoretical  and  review  articles  that  are  pertinent  to 
educational  psychology. 

Manuscript  preparation.  Authors  should  prepare  manuscripts 
according  to  the  Publication  Manual  of  the  American  Psycholog¬ 
ical  Association  (6th  ed.).  Manuscripts  may  be  copyedited  for 
bias-free  language  (see  pp.  70-77  of  the  Publication  Manual). 
Formatting  instructions  (all  copy  must  be  double-spaced)  and 
instructions  on  the  preparation  of  tables,  figures,  references,  metrics, 
and  abstracts  appear  in  the  Manual.  For  APA’s  Checklist  for  Manu¬ 
script  Submission,  see  www.apa.org/pubs/joumals/edu.  Abstract  and 
keywords.  All  manuscripts  must  include  an  abstract  containing  a 
maximum  of  250  words  typed  on  a  separate  page.  After  the 
abstract,  please  supply  up  to  five  keywords  or  brief  phrases. 
References.  References  should  be  listed  in  alphabetical  order. 
Each  listed  reference  should  be  cited  in  text,  and  each  text  citation 
should  be  listed  in  the  References.  Basic  formats  are  as  follows: 

Flughes,  G.,  Desantis,  A.,  &  Waszak,  F.  (2013).  Mechanisms  of 
intentional  binding  and  sensory  attenuation:  The  role  of  tem¬ 
poral  prediction,  temporal  control,  identity  prediction,  and 
motor  prediction.  Psychologial  Bulletin,  139,  133-151. 
http://dx.doi.org/10.1037/a0028566 
Rogers,  T.  T.,  &  McClelland,  J.  L.  (2004).  Semantic  cognition: 
A  parallel  distributed  processing  approach.  Cambridge,  MA: 
MIT  Press. 

Gill,  M.  J.,  &  Sypher,  B.  D.  (2009).  Workplace  incivility  and 
organizational  trust.  In  P.  Lutgen-Sandvik  &  B.  D.  Sypher 
(Eds.),  Destructive  organizational  communication:  Processes, 
consequences,  and  constructive  ways  of  organizing  (pp.  53-73). 
New  York,  NY:  Taylor  &  Francis. 

Adequate  description  of  participants  is  critical  to  the  science  and  practice 
of  educational  psychology;  this  allows  readers  to  assess  the  results, 
determine  generalizability  of  findings,  and  make  comparisons  in  replica¬ 
tions,  extensions,  literature  reviews,  or  secondary  data  analyses.  Authors 
should  see  guidelines  for  sample-subject  description  in  the  Manual. 
Appropriate  indexes  of  effect  size  or  strength  of  relationship  should  be 
incorporated  in  the  results  section  of  the  manuscript  (see  p.  34  of  the 
Manual).  Information  that  allows  the  reader  to  assess  not  only  the  sig¬ 
nificance  but  also  the  magnitude  of  the  observed  effects  or  relationships 
clarifies  the  importance  of  the  findings.  Figures.  Graphics  files  are 
welcome  if  supplied  in  TIFF  or  EPS  format.  APA’s  policy  on  publication 
of  color  figures  is  available  at  http://www.apa.org/pubs/authors/ 
instructions.aspx?item = 6. 

Publication  policies.  APA  policy  prohibits  an  author  from  submitting 
the  same  manuscript  for  concurrent  consideration  by  two  or  more  pub¬ 
lications.  APA  policy  regarding  posting  articles  on  the  Internet  may  be 
found  at  www.apa.org/pubs/authors/posting.aspx.  In  addition,  it  is  a  vio¬ 
lation  of  APA  Ethical  Principles  to  publish  “as  original  data,  data  that 
have  been  previously,  published”  (Standard  8.13).  As  this  is  a  primary 
journal  that  publishes  original  material  only,  APA  policy  prohibits  pub¬ 
lication  of  any  manuscript  or  data  that  have  already  been  published  in 


whole  or  substantial  part  elsewhere.  Authors  have  an  obligation  to  consult 
journal  editors  concerning  prior  publication  of  any  data  on  which  their 
article  depends.  In  addition,  APA  Ethical  Principles  specify  that  “after 
research  results  are  published,  psychologists  do  not  withhold  the  data 
on  which  their  conclusions  are  based  from  other  competent  profes¬ 
sionals  who  seek  to  verify  the  substantive  claims  through  reanalysis 
and  who  intend  to  use  such  data  only  for  that  purpose,  provided  that 
the  confidentiality  of  the  participants  can  be  protected  and  unless  legal 
rights  concerning  proprietary  data  preclude  their  release”  (Stan¬ 
dard  8.14).  Authors  must  have  available  their  data  throughout  the 
editorial  review  process  and  for  at  least  5  years  after  the  date  of 
publication. 

Masked  review  policy.  The  Journal  has  a  masked  review  policy, 
which  means  that  the  identities  of  both  authors  and  reviewers  are  masked. 
Every  effort  should  be  made  by  the  authors  to  see  that  the  manuscript 
itself  contains  no  clues  to  their  identities.  Authors  should  never  use  first 
person  (/,  my,  we,  our)  when  referring  to  a  study  conducted  by  the 
author(s)  or  when  doing  so  reveals  the  authors’  identities,  e.g.,  “in  our 

previous  work,  Johnson  et  al.,  1998  reported  that _ ”  Instead,  references 

to  the  authors’  work  should  be  in  third  person,  e.g.,  “Johnson  et  al.  (1998) 
reported  that  . . ..”  The  authors’  institutional  affiliations  should  also  be 
masked  in  the  manuscript.  Authors  submitting  manuscripts  are  required 
to  include  in  the  cover  letter  the  title  of  the  manuscript  along  with  all 
authors’  names  and  institutional  affiliations.  However,  the  first  page  of  the 
manuscript  should  omit  the  authors’  names  and  affiliations,  but  should 
include  the  title  of  the  manuscript  and  the  date  it  is  submitted.  Respon¬ 
sibility  for  masking  the  manuscript  rests  with  the  authors;  manuscripts 
will  be  returned  to  the  author  if  not  appropriately  masked.  If  the  manu¬ 
script  is  accepted,  authors  will  be  asked  to  make  changes  in  wording  so 
that  the  paper  is  no  longer  masked.  Authors  are  required  to  state  in  writing 
that  they  have  complied  with  APA  ethical  standards  in  the  treatment  of 
their  sample,  or  to  describe  the  details  of  treatment.  A  copy  of  the  APA 
Ethical  Principles  may  be  obtained  at  www.apa.org/ethics/  or  by  writing 
the  APA  Ethics  Office,  750  First  Street,  NE,  Washington,  DC  20002- 
4242.  APA  requires  authors  to  reveal  any  possible  conflict  of  interest  in 
the  conduct  and  reporting  of  research  (e.g.,  financial  interests  in  a  test 
procedure,  funding  by  pharmaceutical  companies  for  drug  research). 
Authors  of  accepted  manuscripts  will  be  required  to  transfer  copyright 
to  APA. 

Permissions.  Authors  of  accepted  papers  must  obtain  and  provide  to 
the  editor  on  final  acceptance  all  necessary  permissions  to  reproduce  in 
print  and  electronic  form  any  copyrighted  work,  including  test  materials 
(or  portions  thereof),  photographs,  and  other  graphic  images  (including 
those  used  as  stimuli  in  experiments).  On  advice  of  counsel,  APA  may 
decline  to  publish  any  image  whose  copyright  status  is  unknown. 

Supplemental  materials.  APA  can  place  supplementary  materials 
online,  which  will  be  available  via  the  published  article  in  the  Psyc- 
ARTICLES  database.  To  submit  such  materials,  please  see 
www.apa.org/pubs/authors/supp-material.aspx  for  details.  Authors  of 
accepted  papers  will  be  asked  to  work  with  the  editor  and  production 
staff  to  provide  supplementary  materials  as  appropriate. 

Submission.  Authors  should  submit  their  manuscripts  electroni¬ 
cally  via  the  Manuscript  Submission  Portal  at  www.apa.org/pubs/ 
joumals/edu/index.aspx  (follow  the  link  for  submission  under  Instruc¬ 
tions  to  Authors).  General  correspondence  may  be  addressed  to  the 
editorial  office  at  CJohnson@apa.org. 
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Critical  Thinking 
About  Research 

Psychology  and  Related  Fields 

SECOND  EDITION 

Julian  Meltzoff  and  Harris  Cooper 

To  become  informed  consumers  of  research, 
students  need  to  thoughtfully  evaluate  the  research 
they  read  rather  than  accept  it  without  question. 
This  second  edition  of  a  classic  text  gives  students 
what  they  need  to  apply  critical  reasoning  when 
reading  behavioral  science  research.  It  updates 
the  original  text  with  recent  developments  in 
research  methods,  including  a  new  chapter 
on  meta-analyses. 

Part  I  gives  a  thorough  overview  of  the  steps  in 
a  research  project.  It  focuses  on  how  to  assess 
whether  the  conclusions  drawn  in  a  behavioral 
science  report  are  warranted  by  the  methods 
used  in  the  research.  Topics  include  research 
hypotheses,  sampling,  experimental  design,  data 
analysis,  interpretation  of  results,  and  ethics. 
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Now  in  its  third  edition,  this  bestselling 
volume  has  set  the  standard  for  students  seeking 
to  find  an  exciting  career  in  psychology.  Its 
comprehensive  coverage  spans  more  careers 
than  ever,  with  the  vast  majority  of  chapters 
new  to  this  edition. 

An  advanced  degree  in  psychology  offers  an 
extremely  wide  range  of  rewarding  and  well- 
compensated  career  opportunities.  Amidst  all  the 
choices,  this  book  will  help  future  psychologists 
find  their  optimal  career  path.  The  chapters 
describe  30  different  graduate-level  careers 
(i.e.,  careers  for  those  holding  a  PhD,  EdD,  or 
PsyD)  in  three  distinct  areas  of  endeavor:  academia,  clinical  and  counseling 
psychology,  and  specialized  settings  such  as  for-profit  businesses,  nonprofits, 
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disadvantages  of  the  career,  opportunities  for  employment  and  advancement, 
and  how  to  plan  one’s  educational  experiences  to  prepare  for  this  specialty.  The 
authors — all  highly  accomplished  professionals- — -were  selected  for  their  years 
of  experience,  their  distinction  in  their  field,  and  their  ability  to  communicate 
their  passion.  2017.  584  pages.  Paperback. 

List:  $29.95  |  APA  Member/Affiliate:  $24.95  |  ISBN  978-1-4338-2310-7  |  Item  #4313041 

CONTENTS 

Introduction  1 1.  ACADEMIA  |  1.  Psychologists  in  University  Departments  of 
Psychology  or  Psychological  Science  |  2.  Psychologists  in  College  Departments  of 
Psychology  or  Psychological  Science  |  3.  Psychologists  in  Schools  of  Education  | 

4.  Psychologists  in  Schools  of  Business  |  5.  Psychologists  in  Medical  Schools  | 

6.  Psychologists  in  Law  Schools  |  7.  Psychologists  in  Schools  of  Public  Policy  | 

II.  CLINICAL  AND  COUNSELING  PSYCHOLOGY  |  8  Clinical  Psychologists  in 
Independent  Practice  |  9.  Psychologists  Specializing  in  Child  and  Adolescent  Clinical 
Psychology  |  10.  Geropsychologists:  Psychologists  Specializing  in  Aging  |  11.  Clinical 
Neuropsychologists  |  12.  Counseling  Psychologists  |  13.  Psychologists  Specializing  in 
Psychopharmacology  |  14.  Psychologists  Specializing  in  Rehabilitation  Psychology  | 

III.  SPECIALIZED  SETTINGS  |  15.  Psychologists  Working  in  Independently  Funded 
Research  Centers  and  Institutes  |  16.  Forensic  Psychologists  |  17.  Sport  Psychologists  | 
18.  Media  Psychologist  |  19.  Consulting  and  Organizational  Psychologists  | 

20.  Psychologists  in  Management  |  21.  Consumer  Psychologists  |  22.  Psychologists  in 
the  Publishing  World  |  23.  Psychologists  Writing  Textbooks  |  24.  Military  Psychologists  | 
25.  Police  and  Public  Safety  Psychologists  |  26.  Psychologists  Giving  Grants  Through 
Nonprofits  |  27.  Psychologists  Giving  Grants  Through  Government  Organizations  ] 

28.  Psychologists  in  Educational  Testing  and  Measurement  Organizations  |  29.  School 
Psychologists  |  30.  Psychologists  Pursuing  Scientific  Research  in  Government 
Service  |  Epilogue:  Preparing  for  a  Career  in  Psychology  |  Index  |  About  the  Editor 


ALSO  OF  INTEREST 


Internshios 

in  Psych 

Finding  Jobs 


You’ve  Earned 
Your  Doctorate 
in  Psychology... 
Vhat? 


With «  v' 

Psychology  Bachelor’s  Degree 


<•>»*. 

Vs* **/  ■'-*&< 


Expat  Advir.0  tor  Launching  Yaa  Circa 

K.  Eric  land  mm 


Internships  in  Psychology 
The  APAGS  Workbook  for  Writing 
Successful  Applications 
and  Finding  the  Right  Fit 

Carol  Williams-Nickelson,  Mitchell  J. 
Prinstein,  and  W.  Gregory  Keilin 
2013.  120  pages.  Paperback. 

List:  $27.95 1  APA  Member/Affiliate:  $22.95 
ISBN  978-1-4338-1210-1  |  Item  #  4313034 

AVAILABLE  ON  AMAZON  KINDLE® 

Finding  Jobs 

With  a  Psychology 

Bachelor's  Degree 

Expert  Advice 

for  Launching  Your  Career 

R.  Eric  Landrum 

2009.  158  pages.  Paperback. 

List:  $24.95 1  APA  Member/Affiliate:  $19.95 
ISBN  978-1-4338-0437-3  |  Item  #4313023 

AVAILABLE  ON  AMAZON  KINDLE® 

You've  Earned  Your 
Doctorate  in  Psychology... 
Mow  What? 

Securing  a  Job  as  an  Academic 
or  Professional  Psychologist 

Elizabeth  M.  Morgan 
and  R.  Eric  Landrum 
2012.  190  pages.  Paperback. 

List:  $24.95  |  APA  Member/Affiliate:  $24.95 
ISBN  978-1-4338-1145-6  |  Item  #4313033 

AVAILABLE  ON  AMAZON  KINDLE® 


APA  BOOKS  ORDERING  INFORMATION:  800-374-2721  •www.apa.org/pubs/books 

In  Washington,  DC,  call:  202-336-5510  •TDD/TTY:  202-336-6123  •  Fax:  202-336-5502 
In  Europe,  Africa,  or  the  Middle  East,  call:  +44  (0)  1767  604972 


AD3114 


SECOND  EDITION 


Expanded,  updated,  and  revised 


American  Psychological  Association 


APA  COLLEGE  DICTIONARY  OF  PSYCHOLOGY 

SECOND  EDITION 

Editor-m-Chief  Gary  R.  VandenBos 


The  American  Psychological  Association  is  exciteci  to 
offer  the  second  edition  of  its  popular,  compact,  and 
economical  student’s  dictionary.  With  some  5,500 
entries — over  500  more  than  the  original  edition — the 
second  edition  continues  to  feature  clear  and  authoritative 
definitions  that  provide  basic  coverage  from  across  90 
subdisciplines  of  psychology. 
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coverage  of  neuropsychology  and  of  statistics  and 
methodology  have  been  enhanced  for  the  second 
edition,  and  two  helpful  appendixes  have  been  included: 
Abbreviations  and  Acronyms  and  Symbols. 
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it  their  major  field  of  study.  2016.  518  pages.  Paperback. 


AMERICAN  PSYCHOLOGICAL  ASSOCIATION 

APA 

College 

Dictionary 

^/Psychology 


List:  $19.95  |  APA  Member/Affiliate:  $19.95  |  ISBN  978-1-4338-2158-5  |  Item  #  4311027 


ALSO  OF  INTEREST 


mm 

APA 
Dictionary 

Clinical 

Psychology 


A  CHOICE  OUTSTANDING 
ACADEMIC  TITLE 

APA  Dictionary 
of  Psychology 

SECOND  EDITION 
2015.  1,204  pages.  Hardcover. 

List:  $49.95 

APA  Member/Affiliate:  $39.95 
ISBN  978-1-4338-1944-5 
Item  #4311022 

AVAILABLE  ON  AMAZON  KINDLE® 


APA  Concise 
Dictionary 
of  Psychology 

Editor-in-Chief 
Gary  R.  VandenBos 
2009.  583  pages.  Hardcover. 

List:  $39.95 

APA  Member/Affiliate:  $29.95 
ISBN  978-1-4338-0391-8 
Item  #4311009 

AVAILABLE  AS  A  MOBILE  APP! 


APA  Dictionary  of 
Clinical  Psychology 

Editor-in-Chief 
Gary  R.  VandenBos 
2013.  636  pages.  Hardcover. 

List:  $39.95 

APA  Member/Affiliate:  $29.95 
ISBN  978-1-4338-1207-1 
Item  #4311016 

AVAILABLE  ON  AMAZON  KINDLE® 


APA  BOOKS  ORDERING  INFORMATION:  800-374-2721  •  www.apa.org/pubs/books 

in  Washington,  DC,  call:  202-336-5510  •  TDD/TTY:  202-336-6123  •  Fax:  202-336-5502 
In  Europe,  Africa,  or  the  Middle  East,  call:  +44  (0)  1767  604972 


AD3077 


American 

Psychological 

Association 


BILINGUALISM  ACROSS  THE  LIFESPAN 

Factors  Moderating  Language  Proficiency 

Edited  by  Elena  Nicoladis  and  Simona  Montanari 


This  book  pioneers  the  study  of 
bilingualism  across  the  lifespan  and  in  all 
its  diverse  forms.  In  framing  the  newest 
research  within  a  lifespan  perspective, 
the  editors  highlight  the  importance 
of  considering  an  individual’s  age  in 
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language  acquisition  and  cognitive 
development.  A  key  theme  is  the 
variability  among  bilinguals,  which 
may  be  due  to  a  host  of  individual  and 
sociocultural  factors,  including  the 
degree  to  which  bilingualism  is  valued 
within  a  particular  context.  Thus,  this 
book  is  a  call  for  language  researchers, 
psychologists,  and  educators  to  pursue  a  better  understanding  of 
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or  grandparents.  While  the  U.S. 
economy  becomes  ever  more 
information-driven,  our  system  of 
education  seems  stuck  on  the  idea 
that  “content  is  king,”  neglecting 
other  skills  that  21st  century 
citizens  sorely  need. 
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